Task suite
Main evaluation task collection
204+ tasks in a standardized JSON format, each with metadata (author, category, metrics, prompt template, ground truth).
First crowdsourced LLM evaluation benchmark assembled by 450+ authors from 132 institutions, containing 204+ tasks designed to exceed current model capabilities and measure emergent abilities at scale.
1. Task crowdsourcing: researchers from 132 institutions contribute tasks in a standardized JSON format (multiple choice, generative, programmatic). Each task includes description, examples, metrics, and ground truth. 2. Validation: the central team verifies that the task is hard for current models (GPT-2, GPT-3 baseline) and has clear evaluation criteria. 3. Distribution: tasks published as a GitHub repository (Apache 2.0) with a library for running benchmarks via model APIs. 4. Evaluation: the model is prompted with each task (zero-shot or few-shot); output is scored against ground truth by task metrics (accuracy, ROUGE, BLEU, exact match). 5. Aggregation: results published on the BIG-Bench leaderboard, broken down by task category and analyzed for emergence as a function of parameter scale. 6. BBH (BIG-Bench Hard): a 23-task subset where standard CoT prompting yields clearly better results than direct prompting β the canonical reasoning suite.
Pre-2022 LLM benchmarks (GLUE, SuperGLUE) saturated quickly with larger models and did not test a broad spectrum of capabilities. The community lacked a benchmark sufficiently hard, diverse, and open to track progress across multiple model generations. BIG-Bench addressed this by crowdsourcing 200+ tasks specifically chosen as hard, spanning domains from mathematics to theory of mind.
Main evaluation task collection
204+ tasks in a standardized JSON format, each with metadata (author, category, metrics, prompt template, ground truth).
Curated subset for reasoning evaluation
23 tasks where standard prompting underperforms humans; CoT prompting significantly improves performance. The canonical reasoning test (Suzgun et al. 2022).
Library for running evaluations
Python framework integrating with model APIs (OpenAI, Anthropic, HuggingFace), supporting multiple choice, generative scoring, and programmatic evaluation.
Cost-efficient subset for fast evaluation
24 tasks optimized for low evaluation cost (small example counts) while preserving the diversity of full BIG-Bench.
Fully parallel
Each task and each example in the benchmark is independent β they can be evaluated in parallel on any number of devices. The bottleneck is model API rate limits, not the benchmark itself.
Evaluation subset
Choice: full (204+ tasks), BBH (23 reasoning tasks), Lite (24 cost-efficient tasks), or a custom category-based subset.
Prompting strategy
Direct prompting vs Chain-of-Thought. BBH shows significant differences β CoT improves results by 10β30 pp on most tasks.
Number of shots
Zero-shot, 1-shot, few-shot (3β8). Most BIG-Bench benchmarks use zero-shot or 3-shot as the standard.
Metric type
Per task: exact match, multiple choice accuracy, ROUGE, BLEU, BLEURT, programmatic check, custom.
The BIG-Bench repository has been publicly available on GitHub since 2022. Models trained after 2022 may have tasks in their pretraining corpus, artificially inflating scores.
Decontamination pipeline on the pretraining corpus (Brown-et-al. style β 13-gram match). Evaluate on fresh tasks (held-out, post-training).
Each task has its own metric (accuracy, ROUGE, BLEU, custom). Arithmetic mean is misleading β some tasks range 0β1, others 0β100.
Apply metric normalization (calibrated score, per-task z-score) or report per category/subset.
GPT-5, Gemini 3, Claude Opus 4 reach 95β98% average accuracy on BBH. The benchmark loses its ability to discriminate top-tier models.
Use BBH only as a sanity check; for differentiating frontier models use GPQA, MMLU-Pro, FrontierMath, ARC-AGI.
Schaeffer et al. (2023) showed that some "emergence jumps" on BIG-Bench arise from discrete metrics (accuracy) β under continuous metrics (cross-entropy), behavior is smooth.
Report both accuracy and continuous metrics (negative log-likelihood). Be cautious when claiming phase transitions.
Python Β· Google + BIG-bench collaboration
Python Β· Mirac Suzgun et al.
Python Β· EleutherAI
Python Β· Stanford CRFM
GENESIS Β· Source paper
Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language modelsBIG-bench collaboration kicks off
Google announces an open call to crowdsource LLM evaluation tasks. Goal: hard, diverse, open tasks.
BIG-Bench released (204 tasks, 442 authors)
breakthroughFirst public benchmark release as a GitHub repository. Evaluation of GPT-3, PaLM 540B, and several open-weight models.
BIG-Bench Hard (BBH) β 23 reasoning tasks (Suzgun et al.)
breakthroughCurated subset of tasks where CoT prompting yields a clear gain over direct prompting. BBH becomes the canonical reasoning test.
GPT-4 achieves breakthrough results on BBH
breakthroughOpenAI reports that GPT-4 with CoT exceeds the human baseline on most BBH tasks β the first model generation to do so.
Schaeffer et al. β critique of emergence as a metric artifact
Stanford NLP shows that some emergence jumps on BIG-Bench disappear after switching metrics from accuracy to continuous ones (e.g. cross-entropy).
BBH starts to saturate β frontier models hit 90%+ accuracy
Claude 3.5, Gemini 1.5 Pro, GPT-4o reach 90%+ average accuracy on BBH; demand grows for harder benchmarks (GPQA, MMLU-Pro).
BBH as a reference test for reasoning models
Reasoning models (o1, DeepSeek-R1, Gemini 2.5 Deep Think, Claude Opus 4 Thinking) use BBH as a standard reference alongside GPQA and AIME.
BIG-Bench is an evaluation benchmark β it requires no specific hardware. It runs wherever the model runs (GPU, TPU, CPU, remote API).
MMLU (Massive Multitask Language Understanding) is a benchmark proposed by Hendrycks et al. in 2021, covering 57 domains ranging from elementary mathematics, US history, and computer science to law, medicine, and ethics. The dataset contains over 14,000 multiple-choice questions drawn from academic and professional exams. MMLU became the de facto standard for measuring general knowledge and reasoning capabilities of LLMs; it revealed that early models barely exceeded random chance accuracy (25%), while 2023-2024 models achieve above 85-90%, leading to harder successors (MMLU-Pro, GPQA).
GO TO CONCEPTHELM (Holistic Evaluation of Language Models) is an evaluation framework developed by the Stanford Center for Research on Foundation Models (CRFM) and published in 2022. Rather than ranking models by a single accuracy metric, HELM evaluates 30 prominent LLMs across 42 scenarios (16 core + 26 targeted) using 7 metrics: accuracy, calibration, robustness, fairness, bias, toxicity, and computational efficiency. The project serves as a "living benchmark" β regularly extended with new models and scenarios. HELM publicly releases all raw model prompts and completions, establishing a transparency standard for AI evaluation.
GO TO CONCEPTGPQA (Graduate-Level Google-Proof Q&A Benchmark) is a benchmark developed by Rein et al. (2023) containing 448 multiple-choice questions written by domain experts with or pursuing PhDs in biology, physics, and chemistry. Questions are designed to be "Google-proof" β highly skilled non-expert validators reached only 34% accuracy after 30 minutes of unrestricted web search, while domain experts reached 65% (74% discounting clear retrospective mistakes). The strongest GPT-4 configuration achieved 39% in the original paper. GPQA is used as a measure of model capabilities for scalable oversight tasks, when AI may surpass human supervisor skills.
GO TO CONCEPTAGIEval is a benchmark developed by Zhong et al. (Microsoft Research, 2023) specifically designed to evaluate foundation models in the context of official human qualification exams: college entrance exams, law school admission tests, math competitions, and bar exams. The benchmark covers bilingual tasks (English and Chinese) and multimodal content (text + mathematical formulas). GPT-4 surpassed average human performance on SAT, LSAT, and math competitions, reaching 95% accuracy on SAT Math and 92.5% on the Chinese national college entrance English exam. At the same time, GPT-4 showed difficulty with tasks requiring complex reasoning or specialized domain knowledge.
GO TO CONCEPTMMLU-Pro is an enhanced benchmark introduced by Wang et al. (2024), developed in response to saturation of the original MMLU by modern models. Key changes from MMLU: (1) answer choices expanded from 4 to 10, reducing random guessing effectiveness; (2) trivial and noisy questions removed; (3) reasoning-focused multi-step questions added, where CoT outperforms direct answering (unlike original MMLU). MMLU-Pro causes a 16-33% drop in model accuracy compared to MMLU and reduces score sensitivity to prompt variations from 4-5% to 2%. Accepted at NeurIPS 2024 (Spotlight).
GO TO CONCEPTFrontierMath is a benchmark developed by Glazer et al. (2024) containing hundreds of original, exceptionally challenging mathematics problems created and vetted by expert mathematicians. Questions cover most major branches of modern mathematics β from computationally intensive problems in number theory and real analysis to abstract questions in algebraic geometry and category theory. Solving a typical problem requires multiple hours of effort from a researcher; the hardest questions may take several days. FrontierMath uses new, unpublished problems and automated answer verification (checking computations via Python interpreter), minimizing the risk of data contamination. State-of-the-art AI models solve under 2% of problems.
GO TO CONCEPTARC-AGI (Abstraction and Reasoning Corpus for Artificial General Intelligence) is a benchmark designed by Francois Chollet (Google) and published in 2019 in the paper "On the Measure of Intelligence". The benchmark consists of visual tasks requiring the discovery of a transformation rule from a few input-output examples and its application to a new case. Tasks rely solely on "core knowledge priors" β fundamental cognitive abilities shared by all humans (space, shape, number, motion). Humans achieve ~85% without training; early AI models barely exceeded 0-5%. ARC Prize (public competition 2024/2025) led to breakthroughs: in 2024 the first systems achieved ~55-85% on the private test set. The benchmark evaluates "skill-acquisition efficiency" β Chollet's measure of intelligence β rather than accumulated knowledge.
GO TO CONCEPTA Large Language Model (LLM) is a class of machine learning models based on the Transformer architecture, trained on large text datasets via autoregressive language modeling (next-token prediction). These models have billions of parameters and can generate coherent text, answer questions, write code, translate languages, and perform many other language-cognitive tasks without task-specific fine-tuning. The term covers models such as GPT, LLaMA, Gemini, Claude, and Mistral. Most modern LLMs are instruction-tuned (SFT + RLHF) after the pre-training phase.
GO TO CONCEPTChain-of-Thought (CoT) Reasoning is a prompting technique introduced by Wei et al. (2022) in which a large language model is induced to generate a series of intermediate natural-language reasoning steps as part of its output, prior to producing a final answer. The technique was shown to significantly improve LLM performance on arithmetic, commonsense, and symbolic reasoning benchmarks where standard few-shot prompting yields flat or poor results. In the original formulation (few-shot CoT), a small number of exemplar question-answer pairs are included in the prompt, where each answer consists of a chain of thought followed by the final answer. The model learns from these demonstrations to produce its own reasoning chains. A subsequent zero-shot variant (Kojima et al., 2022) showed that appending the phrase 'Let's think step by step' to a question is sufficient to elicit reasoning chains from large models without any exemplars. CoT is an emergent property: empirical results in the originating paper show that reasoning ability via CoT prompting appears only in models above a certain parameter threshold (approximately 100B parameters for the models tested in 2022), with smaller models not benefiting or performing worse. This relationship has been revisited by subsequent work as smaller models have been fine-tuned on CoT data. Key extensions include Self-Consistency CoT (Wang et al., 2022), which samples multiple reasoning paths and selects the most frequent final answer; Tree of Thoughts (Yao et al., 2023), which frames reasoning as search over a tree of intermediate thoughts; and native reasoning models such as OpenAI o1 (2024) and DeepSeek-R1 (2025), which internalize extended reasoning through reinforcement learning on process reward signals rather than relying on prompting.
GO TO CONCEPTIn-Context Learning (ICL) is the ability of large language models to perform a new task from a handful of examples (called demonstrations or shots) given directly in the prompt, without modifying model weights. The concept was formalized by Brown et al. (2020) in the GPT-3 paper "Language Models are Few-Shot Learners" as an emergent capability of models at β₯175B-parameter scale. In ICL, the prompt contains k (input, output) pairs demonstrating the task, followed by a new query input. Conditioned on these examples, the model produces output following the demonstration pattern. The number of examples k defines variants: zero-shot (k=0, natural-language task description only), one-shot (k=1), and few-shot (k=2β32, typically 4β8). Brown et al. showed that GPT-3 175B achieves competitive performance against fine-tuned models on many NLP tasks β using few-shot prompting alone. The underlying mechanism of ICL remains an active research topic. Main hypotheses: (1) ICL implements implicit gradient descent in attention activation space (AkyΓΌrek et al. 2022, von Oswald et al. 2023); (2) models perform pattern matching over distributions of patterns seen during pretraining (Xie et al. 2022 β Bayesian inference framework); (3) ICL relies on induction heads β attention structures forming during pretraining (Olsson et al. 2022, Anthropic). Empirically, demonstration quality, ordering, and even labels significantly affect performance (Min et al. 2022). ICL is the foundation of a broader family of prompt-engineering techniques: Chain-of-Thought (Wei et al. 2022) extends ICL with reasoning chains in demonstrations, instruction tuning (FLAN, T0) strengthens zero-shot ICL, and Retrieval-Augmented Generation dynamically selects demonstrations from a knowledge base. ICL became the dominant paradigm for using LLMs from 2022β2024, before being supplemented by instruction-tuned models requiring fewer or no examples.
GO TO CONCEPTInstruction Tuning (also called instruction fine-tuning or supervised fine-tuning, SFT) is a post-pretraining technique for language models. A pretrained model is fine-tuned on a curated dataset of examples, where each example consists of a natural language instruction describing a task, an optional input context, and the expected output. The training objective is standard supervised learning: cross-entropy loss over the target output tokens, with loss masked on the instruction/input portions. The key finding, established by Wei et al. (2021) in the FLAN paper, is that training on a sufficiently large and diverse set of instruction-formatted tasks improves zero-shot generalization to unseen task types. This generalization scales with the number of task clusters and the model size. Instruction Tuning is distinct from RLHF (Reinforcement Learning from Human Feedback): it uses only supervised learning on demonstration data, without a reward model or RL optimization. In practice, instruction tuning is often the first stage in a post-training pipeline, followed optionally by RLHF or direct preference optimization (DPO). Common dataset formats include the Alpaca three-field format (instruction, input, output) and the multi-turn conversation format used in chat models.
GO TO CONCEPTReinforcement Learning from Human Feedback (RLHF) is a multi-stage training pipeline used to align language models and other AI systems with human preferences and intent. The approach was formally introduced for deep RL in Christiano et al. (2017), and scaled to large language models in Ouyang et al. (2022) (InstructGPT), where it became the primary alignment technique for systems such as ChatGPT, Claude, and Gemini. The standard RLHF pipeline for LLMs consists of three sequential stages: 1. Supervised Fine-Tuning (SFT): A pretrained language model is fine-tuned on a curated dataset of high-quality (prompt, response) pairs produced by human annotators, yielding a base aligned policy Ο_SFT. 2. Reward Model Training: Human annotators compare pairs of model responses to the same prompt and express preferences (which response is better). These pairwise comparisons are used to train a scalar reward model r_Ο(x, y), typically using a Bradley-Terry model as the preference objective: loss = -E[log Ο(r(x, y_w) - r(x, y_l))], where y_w is the preferred and y_l the rejected response. 3. RL Fine-Tuning via PPO: The SFT-initialized policy Ο_ΞΈ is optimized with Proximal Policy Optimization (PPO) to maximize the reward from r_Ο, subject to a KL divergence penalty that prevents the policy from drifting too far from Ο_SFT: Objective(x, y) = r_Ο(x, y) β Ξ² Β· KL(Ο_ΞΈ(y|x) || Ο_SFT(y|x)). The KL penalty with coefficient Ξ² is critical to prevent reward hacking. During PPO training, four models are needed simultaneously: the active policy, a frozen reference policy (Ο_SFT), the reward model, and a value/critic network. This makes RLHF computationally expensive, requiring substantial GPU memory. A key limitation is reward hacking: since the reward model is a proxy for human preferences trained on finite data, the policy can find ways to exploit its imperfections β generating outputs that score highly on the reward model but are degenerate or low-quality. The KL penalty is the primary mitigation mechanism. Direct Preference Optimization (DPO, Rafailov et al., 2023) was proposed as a mathematically equivalent simplification of RLHF that eliminates the explicit reward model and RL training loop, replacing them with a single supervised loss directly on preference pairs.
GO TO CONCEPTEmergent abilities of large language models is an observation, formalized by Wei et al. (2022), that certain LLM capabilities β such as multi-step reasoning, zero-shot instruction following, modular arithmetic, or answering questions in low-resource languages β do not appear gradually with scale but emerge discontinuously, only after crossing a threshold of parameter count, training data, or compute FLOPs. Below the threshold, performance is random or near-zero; above it, performance jumps abruptly to substantially better-than-random. The phenomenon has been documented across more than 130 tasks from the BIG-Bench benchmark and other suites (MMLU, TruthfulQA). Canonical examples include: Chain-of-Thought reasoning (~100B-parameter threshold for PaLM/GPT-3), InstructGPT-style instruction following, modular arithmetic, International Phonetic Alphabet transliteration, and multi-step question answering. In 2023, Schaeffer, Miranda, and Koyejo (NeurIPS 2023, "Are Emergent Abilities of Large Language Models a Mirage?") challenged emergence as a real fundamental phenomenon. They showed that non-linear or discontinuous evaluation metrics (e.g. exact-match accuracy) artificially create the appearance of a jump β replacing them with continuous metrics (token edit distance, log-likelihood) reveals a smooth, predictable scaling curve. This critique is now central to the debate: some abilities are emergent in a metric-dependent sense, while others (e.g. inductive reasoning) appear to show genuine phase discontinuities. The concept has critical practical significance: if emergence is real, certain abilities cannot be predicted or trained at smaller scale β forcing organizations to train large models "blindly." If emergence is a metric artifact, then scaling laws (Hoffmann et al., Chinchilla) are sufficient to predict the behavior of larger models.
GO TO CONCEPTA reasoning model (also: large reasoning model, LRM, reasoning language model, RLM) is a type of large language model that has been specifically post-trained to solve complex multi-step problems by explicitly generating intermediate reasoning steps before committing to a final response. Unlike standard LLMs that generate a direct response in a single forward pass, reasoning models allocate additional computation at inference time β a property known as test-time compute scaling β by producing a long internal chain of thought (CoT). The reasoning trace typically includes steps such as problem decomposition, hypothesis generation, self-verification, reflection, and correction of errors. The defining characteristics of reasoning models are: (1) post-training via large-scale reinforcement learning (RL) using reward signals based on final answer correctness (and sometimes intermediate step quality via process reward models); (2) the emergence of extended, often hidden, reasoning traces that precede the final answer; (3) a consistent empirical relationship between the length or computational budget allocated to the reasoning trace and final answer quality (test-time scaling law); (4) superior performance on verifiable tasks requiring multi-step logic, such as mathematics, competitive programming, and scientific reasoning. The term 'reasoning model' was introduced as a product category by OpenAI in September 2024 with the release of the o1-preview model. OpenAI described o1 as trained via a large-scale RL algorithm teaching the model to use chain of thought productively. The approach does not rely on explicit tree search algorithms; instead, implicit search emerges via RL-trained CoT generation. In January 2025, DeepSeek published the first detailed open technical description of this class of models in the DeepSeek-R1 paper (arXiv:2501.12948), demonstrating that reasoning capabilities can be incentivized via pure RL without supervised fine-tuning, using Group Relative Policy Optimization (GRPO) as the RL framework. Reasoning models typically employ the same base Transformer decoder architecture as standard LLMs, with the key difference residing entirely in the post-training pipeline: RL replaces or augments standard RLHF/SFT, and reward signals are grounded in verifiable outcomes. The resulting models generate substantially longer token sequences during inference (reasoning tokens), which are often hidden from end users but incur real compute costs. Performance consistently improves with both more training-time RL compute and more inference-time thinking budget.
GO TO CONCEPTScaling Laws are empirical regularities discovered by Kaplan et al. (2020) at OpenAI, describing how the performance of language models changes predictably with model size (parameter count N), dataset size (D), and compute budget (C). Cross-entropy loss scales as power laws with each of these three variables across many orders of magnitude. The study showed that architectural configuration (depth, width) has minimal impact at fixed N and C, that larger models are significantly more sample-efficient, and that optimally efficient training requires very large models on a relatively modest amount of data with early stopping. Hoffmann et al. (Chinchilla, 2022) refined these laws, showing that earlier models (including GPT-3) were massively undertrained and that optimal N and D should scale equally.
GO TO CONCEPTPretraining (self-supervised pretraining) is the first and most expensive stage in building modern foundation models. The model learns to predict missing or next portions of data β next tokens in text, masked words, future video frames, future robot states β without human labels. This unlocks virtually unlimited raw data (web crawls, code, books, YouTube video, robot telemetry). The result is a set of weights encoding "world knowledge" β dense statistical representations that can later be fine-tuned, instruction-tuned, or RLHF-aligned for any downstream task. Pretraining underpins GPT, BERT, CLIP, Llama, Gemini, and robotics foundation models (Pi-Zero, Gemini Robotics, Ti0).
GO TO CONCEPT| Title | Publisher | Type |
|---|---|---|
| Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models (Srivastava et al. 2022) | BIG-bench collaboration / arXiv | scientific article |
| Challenging BIG-Bench Tasks and Whether Chain-of-Thought Can Solve Them (Suzgun et al. 2022) | arXiv | scientific article |
| BIG-bench β official GitHub repository | repository | |
| Are Emergent Abilities of Large Language Models a Mirage? (Schaeffer et al. 2023) | Stanford NLP / arXiv | scientific article |
| TMLR publication of BIG-Bench (peer-reviewed version) | Transactions on Machine Learning Research | scientific article |