Massive Multitask Language Understanding

First benchmark spanning 57 academic and professional domains, revealing that language models fail broadly on tasks requiring wide factual world knowledge despite impressive narrow-task performance.

Common pitfalls

Benchmark saturation

HIGH

Modern models exceed 85-90% on MMLU, making it insufficient to differentiate top-tier models.

Use MMLU-Pro or GPQA for more challenging evaluations.

Training data contamination

HIGH

MMLU questions may have appeared in model training data, inflating scores.

Cross-reference with benchmarks using new, unpublished questions (e.g. FrontierMath).

Reference implementations

MMLU – official repositoryofficial

Python

GENESIS · Source paper

Measuring Massive Multitask Language Understanding

2021ICLR 2021Dan Hendrycks, Collin Burns, Steven Basart et al.

2021

MMLU published (ICLR 2021)

breakthrough

Hendrycks et al. introduce the 57-task benchmark; GPT-3 barely beats random guessing.

2022

GPT-3.5 and PaLM exceed 70%

Large models begin clearly exceeding human-level on some categories.

2023

GPT-4 reaches ~86%, MMLU loses discriminative power

breakthrough

Benchmark saturation leads to creation of MMLU-Pro and GPQA as successors.

Hardware agnosticPRIMARY

The benchmark is hardware-agnostic – it evaluates model outputs on text questions without GPU/TPU requirements on the evaluation side.

Commonly used with

BIG-Bench

BIG-Bench (Beyond the Imitation Game Benchmark) is a large-scale evaluation benchmark for large language models, published in 2022 by Srivastava et al. (BIG-bench collaboration — 450+ authors from 132 institutions). The suite contains 204 tasks (with later additions, 214+) covering a broad spectrum of capabilities: logical reasoning, mathematics, programming, common sense, linguistics, ethics, medicine, biology, mythology, wordplay, ASCII code, planning, theory of mind, and many more. Each task was contributed by the community as a HARD TASK — one that contemporary models could not solve. The benchmark is released under Apache 2.0 as a GitHub repository with a standardized task format (JSON) and evaluation harness (multiple choice, generative scoring, programmatic). The BIG-Bench Hard (BBH) subset — 23 tasks where models scored worse than humans — became the canonical reasoning test for GPT-4, Claude, Gemini, Llama 3, and later models. Srivastava et al. also formalized definitions of emergent abilities: capabilities where accuracy grows non-linearly with scale (a "phase transition"). BIG-Bench was central to the 2022–2023 emergence debate (Wei et al., Schaeffer et al. — critical response). The benchmark influenced later projects HELM (Stanford CRFM), MMLU (Hendrycks et al.), AGIEval, GPQA, and the robotics-focused Open-X-Embodiment. In 2025, BIG-Bench Hard remains used as a reference reasoning suite despite rapid saturation — frontier LLMs (GPT-5, Gemini 3, Claude Opus 4) reach 95–98% accuracy on BBH.

GO TO CONCEPT