Evaluating LLM-Judge Evaluations: Best Practices
December 09, 2024
30 min
Free
llm
llm-evaluation
generative-ai
ai-collusion
evaluation-metrics
machine-learning
mlops
prompt-engineering
large-language-models
ai-bias
Description
In this talk, Aishwarya Naresh Reganti, Applied Scientist at Amazon, discusses the challenges of evaluating LLM-based judges and presents best practices to mitigate the "AI Collusion Problem." The presentation covers why traditional metrics struggle with generative AI, introduces LLM judges as a solution, and explores common biases like shared knowledge bias, superficial feature prioritization, self-enhancement bias, and verbosity bias. A five-step framework is proposed for calibrating LLM judges, emphasizing the importance of subject matter experts, concise metrics, dataset freezing, blind judging, and iterative refinement. The talk also touches upon advanced techniques like Chain of Thought prompting and pairwise comparisons.