Lessons learned from scaling large language models in production

May 16, 2024 41 min Free

MLOps World - MLOps World & Generative AI World 2024

ray-serve large-language-models llm rag mlops gpu performance-optimization inference scaling python fastapi kubernetes vm vector-database

Description

Open source models have made running your own LLM accessible many people. It's pretty straightforward to set up a model like Mistral, with a vector database, and build your own RAG application. But making it scale to high traffic demands is another story. LLM inference itself is slow, and GPUs are expensive, so we can't simply throw hardware at the problem. Once you add things like guardrails to your application, latencies compound. In this talk, Matt Squire will share the lessons learned from experience building and running LLMs for customers at scale. Using real code examples, he'll cover performance profiling, getting the most out of GPUs, and interactions with guardrails.

Up Next

Serving GenAI Workloads At Scale With LitServe

Serving GenAI Workloads At Scale With LitServe

MLOps World - MLOps World & Generative AI World 2024

lit-serve pytorch-lightning api-serving dynamic-batching genai ai llm python autoscaling rag openai-api docker

From Idea to Production: AI Infra for Scaling LLM Apps

From Idea to Production: AI Infra for Scaling LLM Apps

MLOps World - MLOps World & Generative AI World 2024

llm ai ai-infrastructure llm-ops prompt-engineering model-deployment gpu data-pipelines rag cost-optimization generative-ai llm-applications

How to Run Your Own LLMs, From Silicon to Service

How to Run Your Own LLMs, From Silicon to Service

MLOps World - MLOps World & Generative AI World 2024

llms large-language-models mlops machine-learning-operations inference gpu quantization tensorrt-llm vllm modal-labs model-serving ai-engineering

Creating our own Private OpenAI API

Creating our own Private OpenAI API

MLOps World - MLOps World & Generative AI World 2024

Meryem Arik Hannes Hapke

large-language-models llms private-api openai-api self-hosting mlops generative-ai inference-optimization quantization gpu-utilization api-gateway kubernetes

Running Multiple Models on the Same GPU, on Spot Instances

Running Multiple Models on the Same GPU, on Spot Instances

MLOps World - MLOps World & Generative AI World 2024

ml-inference spot-instances gpu-fractionalization gpu cost-optimization generative-ai llm cloud-computing aws gcp azure mlops

Running prompts at CI does not make your GenAI app enterprise ready

Running prompts at CI does not make your GenAI app enterprise ready

MLOps World - MLOps World & Generative AI World 2024

production-ai genai generative-ai llm large-language-models enterprise-ai continuous-integration ci-cd observability telemetry prompt-engineering testing