BIG-Bench (Beyond the Imitation Game Benchmark)

First crowdsourced LLM evaluation benchmark assembled by 450+ authors from 132 institutions, containing 204+ tasks designed to exceed current model capabilities and measure emergent abilities at scale.

Task suite

Main evaluation task collection

204+ tasks in a standardized JSON format, each with metadata (author, category, metrics, prompt template, ground truth).

BIG-Bench Hard (BBH)

Curated subset for reasoning evaluation

23 tasks where standard prompting underperforms humans; CoT prompting significantly improves performance. The canonical reasoning test (Suzgun et al. 2022).

Evaluation harness

Library for running evaluations

Python framework integrating with model APIs (OpenAI, Anthropic, HuggingFace), supporting multiple choice, generative scoring, and programmatic evaluation.

Lite subset (BIG-Bench Lite)

Cost-efficient subset for fast evaluation

24 tasks optimized for low evaluation cost (small example counts) while preserving the diversity of full BIG-Bench.

Parallelism

Fully parallel

Each task and each example in the benchmark is independent — they can be evaluated in parallel on any number of devices. The bottleneck is model API rate limits, not the benchmark itself.

Evaluation subset

Critical

fullAll 204+ tasks — full evaluation.
BBHBIG-Bench Hard — 23 hard reasoning tasks.
LiteBIG-Bench Lite — 24 cost-optimized tasks.

Choice: full (204+ tasks), BBH (23 reasoning tasks), Lite (24 cost-efficient tasks), or a custom category-based subset.

Prompting strategy

Standard

directDirect question, no reasoning step.
cotChain-of-Thought — model reasons step by step.
few-shot cotCoT with 3–8 demonstrations.

Direct prompting vs Chain-of-Thought. BBH shows significant differences — CoT improves results by 10–30 pp on most tasks.

Number of shots

Standard

Zero-shot, 1-shot, few-shot (3–8). Most BIG-Bench benchmarks use zero-shot or 3-shot as the standard.

Metric type

Standard

Per task: exact match, multiple choice accuracy, ROUGE, BLEU, BLEURT, programmatic check, custom.

Common pitfalls

Data contamination — BIG-Bench tasks in pretraining corpus

HIGH

The BIG-Bench repository has been publicly available on GitHub since 2022. Models trained after 2022 may have tasks in their pretraining corpus, artificially inflating scores.

Decontamination pipeline on the pretraining corpus (Brown-et-al. style — 13-gram match). Evaluate on fresh tasks (held-out, post-training).

Heterogeneous metrics — aggregation difficulty

MEDIUM

Each task has its own metric (accuracy, ROUGE, BLEU, custom). Arithmetic mean is misleading — some tasks range 0–1, others 0–100.

Apply metric normalization (calibrated score, per-task z-score) or report per category/subset.

BBH saturation on frontier models

HIGH

GPT-5, Gemini 3, Claude Opus 4 reach 95–98% average accuracy on BBH. The benchmark loses its ability to discriminate top-tier models.

Use BBH only as a sanity check; for differentiating frontier models use GPQA, MMLU-Pro, FrontierMath, ARC-AGI.

Critique of emergence as a metric artifact

MEDIUM

Schaeffer et al. (2023) showed that some "emergence jumps" on BIG-Bench arise from discrete metrics (accuracy) — under continuous metrics (cross-entropy), behavior is smooth.

Report both accuracy and continuous metrics (negative log-likelihood). Be cautious when claiming phase transitions.

Reference implementations

BIG-bench (official GitHub repository)official

Python · Google + BIG-bench collaboration

BIG-Bench Hard (BBH) — prompts repositoryofficial

Python · Mirac Suzgun et al.

lm-evaluation-harness (BIG-Bench tasks integration)

Python · EleutherAI

HELM — Holistic Evaluation of Language Models (Stanford CRFM)

Python · Stanford CRFM

GENESIS · Source paper

Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models

2022TMLR (Transactions on Machine Learning Research) 2023Aarohi Srivastava, Abhinav Rastogi, Abhishek Rao et al.

2021

BIG-bench collaboration kicks off

Google announces an open call to crowdsource LLM evaluation tasks. Goal: hard, diverse, open tasks.

2022

BIG-Bench released (204 tasks, 442 authors)

breakthrough

First public benchmark release as a GitHub repository. Evaluation of GPT-3, PaLM 540B, and several open-weight models.

Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models

2022

BIG-Bench Hard (BBH) — 23 reasoning tasks (Suzgun et al.)

breakthrough

Curated subset of tasks where CoT prompting yields a clear gain over direct prompting. BBH becomes the canonical reasoning test.

Challenging BIG-Bench Tasks and Whether Chain-of-Thought Can Solve Them

2023

GPT-4 achieves breakthrough results on BBH

breakthrough

OpenAI reports that GPT-4 with CoT exceeds the human baseline on most BBH tasks — the first model generation to do so.

2023

Schaeffer et al. — critique of emergence as a metric artifact

Stanford NLP shows that some emergence jumps on BIG-Bench disappear after switching metrics from accuracy to continuous ones (e.g. cross-entropy).

Are Emergent Abilities of Large Language Models a Mirage?

2024

BBH starts to saturate — frontier models hit 90%+ accuracy

Claude 3.5, Gemini 1.5 Pro, GPT-4o reach 90%+ average accuracy on BBH; demand grows for harder benchmarks (GPQA, MMLU-Pro).

2025

BBH as a reference test for reasoning models

Reasoning models (o1, DeepSeek-R1, Gemini 2.5 Deep Think, Claude Opus 4 Thinking) use BBH as a standard reference alongside GPQA and AIME.

Hardware agnosticPRIMARY

BIG-Bench is an evaluation benchmark — it requires no specific hardware. It runs wherever the model runs (GPU, TPU, CPU, remote API).

ALTERNATIVE TO

MMLU

MMLU (Massive Multitask Language Understanding) is a benchmark proposed by Hendrycks et al. in 2021, covering 57 domains ranging from elementary mathematics, US history, and computer science to law, medicine, and ethics. The dataset contains over 14,000 multiple-choice questions drawn from academic and professional exams. MMLU became the de facto standard for measuring general knowledge and reasoning capabilities of LLMs; it revealed that early models barely exceeded random chance accuracy (25%), while 2023-2024 models achieve above 85-90%, leading to harder successors (MMLU-Pro, GPQA).

Title	Publisher	Type
Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models (Srivastava et al. 2022)	BIG-bench collaboration / arXiv	scientific article
Challenging BIG-Bench Tasks and Whether Chain-of-Thought Can Solve Them (Suzgun et al. 2022)	arXiv	scientific article
BIG-bench — official GitHub repository	Google	repository
Are Emergent Abilities of Large Language Models a Mirage? (Schaeffer et al. 2023)	Stanford NLP / arXiv	scientific article
TMLR publication of BIG-Bench (peer-reviewed version)	Transactions on Machine Learning Research	scientific article

Use cases

How it works

Problem solved

Main components

Computational complexity

Configuration axes

Implementation

Common pitfalls

Reference implementations

History and evolution

Preferred hardware

Semantic relations

ALTERNATIVE TO

Commonly used with

Sources