Tech Talks

How Do You Scale to Billions of Fine-Tuned LLMs

How Do You Scale to Billions of Fine-Tuned LLMs

MLOps World - MLOps World & Generative AI World 2024

cuda batching llm large-language-models fine-tuning lora inference scalability parameter-efficient-fine-tuning gpu mlops ai

LLMs From Dream to Deployed

LLMs From Dream to Deployed

MLOps World - MLOps World & Generative AI World 2024

chatbots seldon llm large-language-models machine-learning mlops deployment retrieval-augmented-generation rag kubernetes openai hugging-face gpu

Running Multiple Models on the Same GPU, on Spot Instances

Running Multiple Models on the Same GPU, on Spot Instances

MLOps World - MLOps World & Generative AI World 2024

ml-inference spot-instances gpu-fractionalization gpu cost-optimization generative-ai llm cloud-computing aws gcp azure mlops

From Idea to Production: AI Infra for Scaling LLM Apps

From Idea to Production: AI Infra for Scaling LLM Apps

MLOps World - MLOps World & Generative AI World 2024

llm ai ai-infrastructure llm-ops prompt-engineering model-deployment gpu data-pipelines rag cost-optimization generative-ai llm-applications

Lessons learned from scaling large language models in production

Lessons learned from scaling large language models in production

MLOps World - MLOps World & Generative AI World 2024

ray-serve large-language-models llm rag mlops gpu performance-optimization inference scaling python fastapi kubernetes vm vector-database

From ML Repository to ML Production Pipeline

From ML Repository to ML Production Pipeline

MLOps World - MLOps World & Generative AI World 2024

Jakub Witkowski Dariusz Adamczyk

production-pipelines ml-repository mlops machine-learning devops docker kubernetes ci-cd kubeflow data-science gpu automation

Leverage Kubernetes To Optimize the Utilization of Your AI Accelerators

Leverage Kubernetes To Optimize the Utilization of Your AI Accelerators

MLOps World - MLOps World & Generative AI World 2024

accelerators kubernetes kubernetes-engine ai gpu optimization training inference workloads resource-utilization cloud-computing

Memory Optimizations for Machine Learning

Memory Optimizations for Machine Learning

MLOps World - MLOps World & Generative AI World 2024

model-pruning neural-networks cpu data-quantization machine-learning llm memory-optimization quantization inference deep-learning transformer-models gpu

How to Run Your Own LLMs, From Silicon to Service

How to Run Your Own LLMs, From Silicon to Service

MLOps World - MLOps World & Generative AI World 2024

llms large-language-models mlops machine-learning-operations inference gpu quantization tensorrt-llm vllm modal-labs model-serving ai-engineering

Large Language Model Training and Serving at LinkedIn

Large Language Model Training and Serving at LinkedIn

MLOps World - MLOps World & Generative AI World 2024

llm large-language-models ai machine-learning mlops training gpu kubernetes python tensorflow pytorch kernels optimization memory-management transformer

Streamlining AI Deployments

Streamlining AI Deployments

MLOps World - MLOps World & Generative AI World 2024

ai llm mlops deployment optimization inference compiler pytorch docker gpu api

Python Meets Heterogeneous Computing

Python Meets Heterogeneous Computing

PyCon - PyCon US 2023

William Cunningham Santosh Kumar Radha

python heterogeneous-computing distributed-computing gpu quantum-computing hpc workflow-orchestration performance-optimization cloud-hpc open-source-tools data-science machine-learning