How to Run Your Own LLMs, From Silicon to Service

December 07, 2024 31 min Free

Description

Charles Frye, AI Engineer at Modal Labs, discusses the complete stack for running your own LLM inference service. The talk covers hardware options (CPUs, GPUs, TPUs, LPUs), model choices (Qwen, LLaMA), inference server solutions (TensorRT-LLM, vLLM, SGLang), and observability tools (OTel stack, LangSmith, W&B Weave, Braintrust). It delves into why one might choose to run their own LLM inference service, focusing on control, security, data governance, and cost savings, especially for bursty workloads or when fine-tuned/distilled models suffice. The presentation also touches upon picking appropriate hardware, building robust evaluation frameworks, selecting and quantizing models, and the importance of inference servers for optimized performance. The discussion highlights the ongoing evolution of LLMs and the serving stack, emphasizing the increasing utility of self-hosted inference as the field matures. Frye also briefly showcases Modal's serverless compute platform for GPU access and mentions other relevant tools and concepts for building efficient LLM inference pipelines.

How to Run Your Own LLMs, From Silicon to Service

Description

Up Next

Creating our own Private OpenAI API

Lessons learned from scaling large language models in production

Running Multiple Models on the Same GPU, on Spot Instances

From Idea to Production: AI Infra for Scaling LLM Apps

Generative AI Infrastructure at Lyft

Streamlining AI Deployments