Robots AtlasRobots Atlas

Graduate-Level Google-Proof Q&A Benchmark

First "Google-proof" PhD-level benchmark where even highly skilled non-expert validators only reach 34% accuracy after 30 minutes of unrestricted web search, testing deep specialist knowledge of AI models that cannot be found by simple web lookup.

Category
Abstraction level
Operation level
frontier AI evaluationscalable oversight researchspecialist knowledge testingsafety evaluation

Questions are created by domain experts and validated by other experts and non-experts. For each question, accuracy was measured for domain experts, non-experts with internet access, and AI models. Format: multiple-choice with 4 options. The benchmark has three subsets: GPQA Diamond (hardest), GPQA Expert (medium), GPQA Extended.

Absence of a benchmark evaluating deep specialist knowledge at PhD level, where typical AI models cannot "bypass" difficulty through information lookup, crucial for scalable oversight research.

Common pitfalls

Small dataset size (448 questions)
MEDIUM

Small dataset size can cause high variance in results between runs.

Run multiple trials and report confidence intervals.

Critical subset distinction
HIGH

Scores on GPQA Diamond vs Extended differ substantially; reporting a score without specifying the subset is misleading.

Always report the subset name alongside the score.

GENESIS Β· Source paper

GPQA: A Graduate-Level Google-Proof Q&A Benchmark
2023arXiv 2023David Rein, Betty Li Hou, Asa Cooper Stickland et al.
2023

GPQA published (arXiv, November 2023)

breakthrough

Rein et al. introduce 448 PhD-level questions; GPT-4 achieves 39%, non-experts 34%.

2024

GPQA Diamond becomes standard frontier AI benchmark

GPT-4o, Claude 3 Opus, and Gemini Ultra report GPQA Diamond scores as a frontier capabilities measure.

Hardware agnosticPRIMARY

Text-based benchmark independent of evaluation hardware.

Commonly used with

MMLU

MMLU (Massive Multitask Language Understanding) is a benchmark proposed by Hendrycks et al. in 2021, covering 57 domains ranging from elementary mathematics, US history, and computer science to law, medicine, and ethics. The dataset contains over 14,000 multiple-choice questions drawn from academic and professional exams. MMLU became the de facto standard for measuring general knowledge and reasoning capabilities of LLMs; it revealed that early models barely exceeded random chance accuracy (25%), while 2023-2024 models achieve above 85-90%, leading to harder successors (MMLU-Pro, GPQA).

GO TO CONCEPT
MMLU-Pro

MMLU-Pro is an enhanced benchmark introduced by Wang et al. (2024), developed in response to saturation of the original MMLU by modern models. Key changes from MMLU: (1) answer choices expanded from 4 to 10, reducing random guessing effectiveness; (2) trivial and noisy questions removed; (3) reasoning-focused multi-step questions added, where CoT outperforms direct answering (unlike original MMLU). MMLU-Pro causes a 16-33% drop in model accuracy compared to MMLU and reduces score sensitivity to prompt variations from 4-5% to 2%. Accepted at NeurIPS 2024 (Spotlight).

GO TO CONCEPT