Massive Multitask Language Understanding
First benchmark spanning 57 academic and professional domains, revealing that language models fail broadly on tasks requiring wide factual world knowledge despite impressive narrow-task performance.
The benchmark consists of multiple-choice questions (4 options) grouped into 57 thematic tasks. Models are evaluated in zero-shot or few-shot settings β given a question and asked to select the answer (A/B/C/D). Results are reported as percentage of correct answers per task and as a weighted average.
The absence of a comprehensive benchmark spanning a wide spectrum of academic and professional domains, making it impossible to reliably compare general world knowledge and problem-solving ability across large language models.
Common pitfalls
Benchmark saturationHIGH
Modern models exceed 85-90% on MMLU, making it insufficient to differentiate top-tier models.
Use MMLU-Pro or GPQA for more challenging evaluations.
Training data contaminationHIGH
MMLU questions may have appeared in model training data, inflating scores.
Cross-reference with benchmarks using new, unpublished questions (e.g. FrontierMath).
Reference implementations
GENESIS Β· Source paper
Measuring Massive Multitask Language UnderstandingMMLU published (ICLR 2021)
breakthroughHendrycks et al. introduce the 57-task benchmark; GPT-3 barely beats random guessing.
GPT-3.5 and PaLM exceed 70%
Large models begin clearly exceeding human-level on some categories.
GPT-4 reaches ~86%, MMLU loses discriminative power
breakthroughBenchmark saturation leads to creation of MMLU-Pro and GPQA as successors.
The benchmark is hardware-agnostic β it evaluates model outputs on text questions without GPU/TPU requirements on the evaluation side.
Commonly used with
BIG-Bench
BIG-Bench (Beyond the Imitation Game Benchmark) is a large-scale evaluation benchmark for large language models, published in 2022 by Srivastava et al. (BIG-bench collaboration β 450+ authors from 132 institutions). The suite contains 204 tasks (with later additions, 214+) covering a broad spectrum of capabilities: logical reasoning, mathematics, programming, common sense, linguistics, ethics, medicine, biology, mythology, wordplay, ASCII code, planning, theory of mind, and many more. Each task was contributed by the community as a HARD TASK β one that contemporary models could not solve. The benchmark is released under Apache 2.0 as a GitHub repository with a standardized task format (JSON) and evaluation harness (multiple choice, generative scoring, programmatic). The BIG-Bench Hard (BBH) subset β 23 tasks where models scored worse than humans β became the canonical reasoning test for GPT-4, Claude, Gemini, Llama 3, and later models. Srivastava et al. also formalized definitions of emergent abilities: capabilities where accuracy grows non-linearly with scale (a "phase transition"). BIG-Bench was central to the 2022β2023 emergence debate (Wei et al., Schaeffer et al. β critical response). The benchmark influenced later projects HELM (Stanford CRFM), MMLU (Hendrycks et al.), AGIEval, GPQA, and the robotics-focused Open-X-Embodiment. In 2025, BIG-Bench Hard remains used as a reference reasoning suite despite rapid saturation β frontier LLMs (GPT-5, Gemini 3, Claude Opus 4) reach 95β98% accuracy on BBH.
GO TO CONCEPT| Title | Publisher | Type |
|---|---|---|
| Measuring Massive Multitask Language Understanding | arXiv | scientific article |
| MMLU Repository | GitHub | repository |