MMLU-Pro
Extended MMLU version eliminating trivial and noisy questions, expanding answer choices from 4 to 10, and enriching the dataset with reasoning-focused questions (not just knowledge), dropping model accuracy by 16-33% and restoring benchmark discriminative power.
The dataset extends MMLU by: (1) consolidating with external sources to remove trivial questions; (2) expanding options to 10 per question; (3) adding multi-step reasoning questions. Models are evaluated zero-shot and with CoT; results show CoT is more effective on MMLU-Pro than on original MMLU.
Saturation of the original MMLU by frontier models (>85-90%) and its sensitivity to prompt variations, making it impossible to distinguish capabilities between top models.
Common pitfalls
10 choices increase token cost in few-shotLOW
A prompt with 10 answer choices is longer, increasing evaluation cost for few-shot with long examples.
Use zero-shot CoT or reduced few-shot (1-3 examples).
Reference implementations
GENESIS Β· Source paper
MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding BenchmarkMMLU-Pro published (June 2024, NeurIPS 2024 Spotlight)
breakthroughWang et al. publish the enhanced MMLU with 10 choices and reasoning questions; model scores drop 16-33%.
Text-based benchmark independent of evaluation hardware.
EXTENDS
MMLU
MMLU (Massive Multitask Language Understanding) is a benchmark proposed by Hendrycks et al. in 2021, covering 57 domains ranging from elementary mathematics, US history, and computer science to law, medicine, and ethics. The dataset contains over 14,000 multiple-choice questions drawn from academic and professional exams. MMLU became the de facto standard for measuring general knowledge and reasoning capabilities of LLMs; it revealed that early models barely exceeded random chance accuracy (25%), while 2023-2024 models achieve above 85-90%, leading to harder successors (MMLU-Pro, GPQA).
GO TO CONCEPTCommonly used with
MMLU
MMLU (Massive Multitask Language Understanding) is a benchmark proposed by Hendrycks et al. in 2021, covering 57 domains ranging from elementary mathematics, US history, and computer science to law, medicine, and ethics. The dataset contains over 14,000 multiple-choice questions drawn from academic and professional exams. MMLU became the de facto standard for measuring general knowledge and reasoning capabilities of LLMs; it revealed that early models barely exceeded random chance accuracy (25%), while 2023-2024 models achieve above 85-90%, leading to harder successors (MMLU-Pro, GPQA).
GO TO CONCEPTGPQA
GPQA (Graduate-Level Google-Proof Q&A Benchmark) is a benchmark developed by Rein et al. (2023) containing 448 multiple-choice questions written by domain experts with or pursuing PhDs in biology, physics, and chemistry. Questions are designed to be "Google-proof" β highly skilled non-expert validators reached only 34% accuracy after 30 minutes of unrestricted web search, while domain experts reached 65% (74% discounting clear retrospective mistakes). The strongest GPT-4 configuration achieved 39% in the original paper. GPQA is used as a measure of model capabilities for scalable oversight tasks, when AI may surpass human supervisor skills.
GO TO CONCEPT| Title | Publisher | Type |
|---|---|---|
| MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark | arXiv | scientific article |
| MMLU-Pro dataset β Hugging Face | Hugging Face | repository |