Evaluating LLM-Judge Evaluations: Best Practices

In this talk, Aishwarya Naresh Reganti, Applied Scientist at Amazon, discusses the challenges of evaluating LLM-based judges and presents best practices to mitigate the "AI Collusion Problem." The presentation covers why traditional metrics struggle with generative AI, introduces LLM judges as a solution, and explores common biases like shared knowledge bias, superficial feature prioritization, self-enhancement bias, and verbosity bias. A five-step framework is proposed for calibrating LLM judges, emphasizing the importance of subject matter experts, concise metrics, dataset freezing, blind judging, and iterative refinement. The talk also touches upon advanced techniques like Chain of Thought prompting and pairwise comparisons.

Evaluating LLM-Judge Evaluations: Best Practices

Description