Robots AtlasRobots Atlas

Massive Multitask Language Understanding

First benchmark spanning 57 academic and professional domains, revealing that language models fail broadly on tasks requiring wide factual world knowledge despite impressive narrow-task performance.

Category
Abstraction level
Operation level
LLM evaluationgeneral knowledge comparisonnatural language understanding assessmentAI research

The benchmark consists of multiple-choice questions (4 options) grouped into 57 thematic tasks. Models are evaluated in zero-shot or few-shot settings – given a question and asked to select the answer (A/B/C/D). Results are reported as percentage of correct answers per task and as a weighted average.

The absence of a comprehensive benchmark spanning a wide spectrum of academic and professional domains, making it impossible to reliably compare general world knowledge and problem-solving ability across large language models.

Common pitfalls

Benchmark saturation
HIGH

Modern models exceed 85-90% on MMLU, making it insufficient to differentiate top-tier models.

Use MMLU-Pro or GPQA for more challenging evaluations.

Training data contamination
HIGH

MMLU questions may have appeared in model training data, inflating scores.

Cross-reference with benchmarks using new, unpublished questions (e.g. FrontierMath).

GENESIS Β· Source paper

Measuring Massive Multitask Language Understanding
2021ICLR 2021Dan Hendrycks, Collin Burns, Steven Basart et al.
2021

MMLU published (ICLR 2021)

breakthrough

Hendrycks et al. introduce the 57-task benchmark; GPT-3 barely beats random guessing.

2022

GPT-3.5 and PaLM exceed 70%

Large models begin clearly exceeding human-level on some categories.

2023

GPT-4 reaches ~86%, MMLU loses discriminative power

breakthrough

Benchmark saturation leads to creation of MMLU-Pro and GPQA as successors.

Hardware agnosticPRIMARY

The benchmark is hardware-agnostic – it evaluates model outputs on text questions without GPU/TPU requirements on the evaluation side.

Commonly used with

BIG-Bench

BIG-Bench (Beyond the Imitation Game Benchmark) is a large-scale evaluation benchmark for large language models, published in 2022 by Srivastava et al. (BIG-bench collaboration β€” 450+ authors from 132 institutions). The suite contains 204 tasks (with later additions, 214+) covering a broad spectrum of capabilities: logical reasoning, mathematics, programming, common sense, linguistics, ethics, medicine, biology, mythology, wordplay, ASCII code, planning, theory of mind, and many more. Each task was contributed by the community as a HARD TASK β€” one that contemporary models could not solve. The benchmark is released under Apache 2.0 as a GitHub repository with a standardized task format (JSON) and evaluation harness (multiple choice, generative scoring, programmatic). The BIG-Bench Hard (BBH) subset β€” 23 tasks where models scored worse than humans β€” became the canonical reasoning test for GPT-4, Claude, Gemini, Llama 3, and later models. Srivastava et al. also formalized definitions of emergent abilities: capabilities where accuracy grows non-linearly with scale (a "phase transition"). BIG-Bench was central to the 2022–2023 emergence debate (Wei et al., Schaeffer et al. β€” critical response). The benchmark influenced later projects HELM (Stanford CRFM), MMLU (Hendrycks et al.), AGIEval, GPQA, and the robotics-focused Open-X-Embodiment. In 2025, BIG-Bench Hard remains used as a reference reasoning suite despite rapid saturation β€” frontier LLMs (GPT-5, Gemini 3, Claude Opus 4) reach 95–98% accuracy on BBH.

GO TO CONCEPT