Running Multiple Models on the Same GPU, on Spot Instances

May 16, 2024 33 min Free

MLOps World - MLOps World & Generative AI World 2024

ml-inference spot-instances gpu-fractionalization gpu cost-optimization generative-ai llm cloud-computing aws gcp azure mlops

Description

Oscar Rovira, Co-founder of Mystic AI, discusses two key cost optimization strategies for running ML inference in the cloud: GPU fractionalization and the use of Spot Instances. He explains what GPU fractionalization is, its benefits and limitations, and the value of using Spot Instances along with their potential challenges. The talk includes examples of how combining these techniques can increase throughput and reduce costs for Generative AI applications.

Up Next

35 min

Efficient Access to Shared GPU Resources: Mechanisms and Use Cases

KubeCon + CloudNativeCon - KubeCon + CloudNativeCon Europe 2023

Diogo Guerra Diana Gaponcic

gpu-scheduling kubernetes nvidia-mig time-slicing resource-management high-energy-physics machine-learning inference ci-cd benchmarking

5 min

How Do You Scale to Billions of Fine-Tuned LLMs

MLOps World - MLOps World & Generative AI World 2024

James Dbiorin

cuda batching llm large-language-models fine-tuning lora inference scalability parameter-efficient-fine-tuning gpu mlops ai

31 min

How to Run Your Own LLMs, From Silicon to Service

MLOps World - MLOps World & Generative AI World 2024

Charles Frye

llms large-language-models mlops machine-learning-operations inference gpu quantization tensorrt-llm vllm modal-labs model-serving ai-engineering

41 min

Lessons learned from scaling large language models in production

MLOps World - MLOps World & Generative AI World 2024

Matt Squire

ray-serve large-language-models llm rag mlops gpu performance-optimization inference scaling python fastapi kubernetes vm vector-database

5 min

Mastering Enterprise-Grade LLM Deployment: Overcoming Production Challenges

MLOps World - MLOps World & Generative AI World 2024

Jaeman An

llm deployment enterprise-ai machine-learning-operations mlops gpu-management model-optimization data-security compliance ai-infrastructure latency-reduction

32 min

Memory Optimizations for Machine Learning

MLOps World - MLOps World & Generative AI World 2024

Tejas Chopra

model-pruning neural-networks cpu data-quantization machine-learning llm memory-optimization quantization inference deep-learning transformer-models gpu

Back to Home