In the recent launch of Google's Large Language Model (LLM) Gemini, Google claimed that "Gemini beats human level performance on MMLU" but you might be wondering:

What are the different evaulation criteria? Massive Multitask Language Understanding (MMLU)? General Language Understanding Evaluation (GLUE)? HumanEval?
How can we know which model is good at coding tasks?
What is considered "human-level performance"?
How do they actually get the results for these benchmarks?

In this talk, Sai breaks down all the benchmarks you need to know to be better informed when papers claim to beat benchmarks.

The next paper reading session is on January 8 at 5pm: https://www.jointaro.com/event/paper-reading-group-4-mistral-7b/

Your host:

Sai Shreyas Bhavanasi has worked in Computer Vision and Reinforcement Learning and published 2 first-author papers. He has also worked as an MLE and Data Analyst.

Demystifying LLM Benchmarks -- How To Compare Large Language Models?