Evaluating LLMs and RAG Pipelines at Scale
Description
This talk introduces Valor, an open-source evaluation service designed to address the challenges of evaluating Large Language Models (LLMs) and Retrieval-Augmented Generation (RAG) pipelines at scale. The presentation covers how Valor facilitates rigorous, real-world testing of these systems, handling issues like unstructured outputs, parameter variations, and the need for production-ready evaluation. It highlights key benefits such as queryable and discoverable metrics, cached inferences to avoid rerunning expensive model computations, and robust lineage tracking. The talk also delves into the specifics of LLMOps evaluation, contrasting traditional ML evaluation with the new paradigms brought by LLMs and RAG, and showcases how Valor can be integrated into existing LLMOps tech stacks. The discussion includes sample code demonstrating how to evaluate a RAG pipeline using Valor, emphasizing its flexibility and rich metadata support for performance bias analysis and tracking LLM pipeline settings.