The State and Future of Cloud-Native Model Serving
Description
KServe is a cloud-native open-source project for serving production ML models built on CNCF projects like Knative and Istio. This talk provides an update on KServe's progress towards 1.0, covering recent developments like ModelMesh and InferenceGraph, and its future roadmap. It delves into the Kubernetes design patterns used in KServe to achieve core ML inference capabilities, its design philosophy, and how it integrates with the CNCF ecosystem. The discussion highlights the InferenceService interface, which encapsulates networking, lifecycle, and server configurations, enabling seamless integration of serverless capabilities with model servers such as TensorFlow Serving, TorchServe, and Triton on CPU/GPU. The talk also explores advanced scenarios demonstrating how to quickly set up KServe for production-ready deployments with scalability, security, observability, and auto-scaling acceleration, leveraging CNCF projects like Knative, Istio, SPIFFE/SPIRE, OpenTelemetry, and Fluid. The presentation also touches upon the challenges and solutions for serving large language models (LLMs) like Bloomberg GPT, including distributed inference techniques and performance optimizations.