How Do You Scale to Billions of Fine-Tuned LLMs

May 15, 2024 5 min Free

MLOps World - MLOps World & Generative AI World 2024

cuda batching llm large-language-models fine-tuning lora inference scalability parameter-efficient-fine-tuning gpu mlops ai

Description

James Dborin of Titan ML discusses Batched LoRA Inference, a method that enables scaling to billions of personalized, fine-tuned LLMs without the prohibitive compute costs. The talk addresses the challenges of deploying numerous specialized LLMs and introduces how parameter-efficient fine-tuning techniques, combined with batching strategies, can allow multiple fine-tuned models to run in parallel on a single GPU. This approach significantly reduces costs and improves efficiency, akin to a CDN for LLMs.

Up Next

42 min

Efficiently Fine-Tune And Serve Your Own LLMs

MLOps World - MLOps World & Generative AI World 2024

Alex Sherstinsky

llm-fine-tuning predibase ludwig lorax large-language-models lora parameter-efficient-fine-tuning peft transformer-models mistral-7b model-serving inference

33 min

Running Multiple Models on the Same GPU, on Spot Instances

MLOps World - MLOps World & Generative AI World 2024

Oscar Rovira

ml-inference spot-instances gpu-fractionalization gpu cost-optimization generative-ai llm cloud-computing aws gcp azure mlops

5 min

Mastering Enterprise-Grade LLM Deployment: Overcoming Production Challenges

MLOps World - MLOps World & Generative AI World 2024

Jaeman An

llm deployment enterprise-ai machine-learning-operations mlops gpu-management model-optimization data-security compliance ai-infrastructure latency-reduction

41 min

Lessons learned from scaling large language models in production

MLOps World - MLOps World & Generative AI World 2024

Matt Squire

ray-serve large-language-models llm rag mlops gpu performance-optimization inference scaling python fastapi kubernetes vm vector-database

29 min

Creating our own Private OpenAI API

MLOps World - MLOps World & Generative AI World 2024

Meryem Arik Hannes Hapke

large-language-models llms private-api openai-api self-hosting mlops generative-ai inference-optimization quantization gpu-utilization api-gateway kubernetes

47 min

LLM Fine-Tuning for Modern AI Teams: How One E-Commerce Unicorn Cut Inference Cost by 90%

MLOps World - MLOps World & Generative AI World 2024

Emmanuel Turlay

inference-cost data-preparation mistral-7b gpt-3.5 cost-reduction llm fine-tuning ai machine-learning e-commerce natural-language-processing model-evaluation

Back to Home