Measuring the Minds of Machines: Evaluating Generative AI Systems

Jineet Doshi, Staff Data Scientist/AI Lead at Intuit, discusses the critical challenges and various approaches to evaluating Generative AI systems, particularly Large Language Models (LLMs). He highlights why traditional evaluation methods fall short for LLMs due to their open-ended nature and broad capabilities. The talk explores techniques ranging from NLP-based metrics and semantic similarity to benchmarks, human labelers, and the increasingly popular 'LLM-as-a-judge' paradigm. Doshi also touches upon evaluating LLMs for safety, security, and more complex systems like Retrieval Augmented Generation (RAG) and agents, emphasizing the need for holistic and scalable evaluation strategies.

Measuring the Minds of Machines: Evaluating Generative AI Systems

Description