MMLU-Pro

frontier LLM evaluationmulti-step reasoning testingfrontier model comparisonCoT assessment

The dataset extends MMLU by: (1) consolidating with external sources to remove trivial questions; (2) expanding options to 10 per question; (3) adding multi-step reasoning questions. Models are evaluated zero-shot and with CoT; results show CoT is more effective on MMLU-Pro than on original MMLU.

Saturation of the original MMLU by frontier models (>85-90%) and its sensitivity to prompt variations, making it impossible to distinguish capabilities between top models.

Common pitfalls

10 choices increase token cost in few-shot

LOW

A prompt with 10 answer choices is longer, increasing evaluation cost for few-shot with long examples.

Use zero-shot CoT or reduced few-shot (1-3 examples).

Reference implementations

MMLU-Pro – Hugging Face Datasetofficial

Python

GENESIS · Source paper

MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark

2024NeurIPS 2024 (Datasets and Benchmarks Track, Spotlight)Yubo Wang, Xueguang Ma, Ge Zhang et al.

2024

MMLU-Pro published (June 2024, NeurIPS 2024 Spotlight)

breakthrough

Wang et al. publish the enhanced MMLU with 10 choices and reasoning questions; model scores drop 16-33%.

Hardware agnosticPRIMARY

Text-based benchmark independent of evaluation hardware.

EXTENDS

MMLU

MMLU (Massive Multitask Language Understanding) is a benchmark proposed by Hendrycks et al. in 2021, covering 57 domains ranging from elementary mathematics, US history, and computer science to law, medicine, and ethics. The dataset contains over 14,000 multiple-choice questions drawn from academic and professional exams. MMLU became the de facto standard for measuring general knowledge and reasoning capabilities of LLMs; it revealed that early models barely exceeded random chance accuracy (25%), while 2023-2024 models achieve above 85-90%, leading to harder successors (MMLU-Pro, GPQA).

GO TO CONCEPT

Commonly used with

MMLU

MMLU (Massive Multitask Language Understanding) is a benchmark proposed by Hendrycks et al. in 2021, covering 57 domains ranging from elementary mathematics, US history, and computer science to law, medicine, and ethics. The dataset contains over 14,000 multiple-choice questions drawn from academic and professional exams. MMLU became the de facto standard for measuring general knowledge and reasoning capabilities of LLMs; it revealed that early models barely exceeded random chance accuracy (25%), while 2023-2024 models achieve above 85-90%, leading to harder successors (MMLU-Pro, GPQA).

GO TO CONCEPT

GPQA

GPQA (Graduate-Level Google-Proof Q&A Benchmark) is a benchmark developed by Rein et al. (2023) containing 448 multiple-choice questions written by domain experts with or pursuing PhDs in biology, physics, and chemistry. Questions are designed to be "Google-proof" – highly skilled non-expert validators reached only 34% accuracy after 30 minutes of unrestricted web search, while domain experts reached 65% (74% discounting clear retrospective mistakes). The strongest GPT-4 configuration achieved 39% in the original paper. GPQA is used as a measure of model capabilities for scalable oversight tasks, when AI may surpass human supervisor skills.

GO TO CONCEPT

Title	Publisher	Type
MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark	arXiv	scientific article
MMLU-Pro dataset – Hugging Face	Hugging Face	repository

MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark

scientific articlearXiv

MMLU-Pro dataset – Hugging Face

repositoryHugging Face

Use cases

How it works

Problem solved

Implementation

Common pitfalls

Reference implementations

History and evolution

Preferred hardware

Semantic relations

EXTENDS

Commonly used with

Sources