Robots AtlasRobots Atlas

Holistic Evaluation of Language Models

First multi-metric LLM evaluation framework simultaneously measuring 7 metrics (accuracy, calibration, robustness, fairness, bias, toxicity, efficiency) across 42 scenarios, revealing model trade-offs invisible in single-metric rankings.

Category
Abstraction level
Operation level
holistic LLM evaluationmodel comparisonAI transparency researchsafety and fairness assessment

HELM defines a taxonomy of scenarios (domain x task x metric) and selects a representative subset. Each of 30 models is evaluated on the same prompts under standardized conditions. Results for 7 metrics are reported per scenario and aggregated into a model profile. The platform is hosted by Stanford CRFM with public access to raw data.

Fragmentation and selectivity in LLM evaluation – models were compared on different datasets with different metrics, making fair comparisons impossible and hiding important trade-offs (e.g. high accuracy with high toxicity).

Common pitfalls

Computational cost of full evaluation
MEDIUM

Evaluating 30 models across 42 scenarios is computationally and financially expensive, limiting access to full evaluation.

Use the subset of 16 core scenarios and a single reference model.

GENESIS Β· Source paper

Holistic Evaluation of Language Models
2022Transactions on Machine Learning Research (TMLR) 2023Percy Liang, Rishi Bommasani, Tony Lee et al.
2022

HELM published (arXiv + TMLR)

breakthrough

Percy Liang and 49 co-authors introduce the framework; 30 models evaluated across 42 scenarios.

2023

HELM published in TMLR, extended with new models

Version v2 extends the benchmark with 2023 models and new scenarios.

Hardware agnosticPRIMARY

Evaluation framework independent of hardware architecture – evaluation runs via API or local model inference.

Commonly used with

MMLU

MMLU (Massive Multitask Language Understanding) is a benchmark proposed by Hendrycks et al. in 2021, covering 57 domains ranging from elementary mathematics, US history, and computer science to law, medicine, and ethics. The dataset contains over 14,000 multiple-choice questions drawn from academic and professional exams. MMLU became the de facto standard for measuring general knowledge and reasoning capabilities of LLMs; it revealed that early models barely exceeded random chance accuracy (25%), while 2023-2024 models achieve above 85-90%, leading to harder successors (MMLU-Pro, GPQA).

GO TO CONCEPT
BIG-Bench

BIG-Bench (Beyond the Imitation Game Benchmark) is a large-scale evaluation benchmark for large language models, published in 2022 by Srivastava et al. (BIG-bench collaboration β€” 450+ authors from 132 institutions). The suite contains 204 tasks (with later additions, 214+) covering a broad spectrum of capabilities: logical reasoning, mathematics, programming, common sense, linguistics, ethics, medicine, biology, mythology, wordplay, ASCII code, planning, theory of mind, and many more. Each task was contributed by the community as a HARD TASK β€” one that contemporary models could not solve. The benchmark is released under Apache 2.0 as a GitHub repository with a standardized task format (JSON) and evaluation harness (multiple choice, generative scoring, programmatic). The BIG-Bench Hard (BBH) subset β€” 23 tasks where models scored worse than humans β€” became the canonical reasoning test for GPT-4, Claude, Gemini, Llama 3, and later models. Srivastava et al. also formalized definitions of emergent abilities: capabilities where accuracy grows non-linearly with scale (a "phase transition"). BIG-Bench was central to the 2022–2023 emergence debate (Wei et al., Schaeffer et al. β€” critical response). The benchmark influenced later projects HELM (Stanford CRFM), MMLU (Hendrycks et al.), AGIEval, GPQA, and the robotics-focused Open-X-Embodiment. In 2025, BIG-Bench Hard remains used as a reference reasoning suite despite rapid saturation β€” frontier LLMs (GPT-5, Gemini 3, Claude Opus 4) reach 95–98% accuracy on BBH.

GO TO CONCEPT
HELM Platform – Stanford CRFM
official websiteStanford CRFM