Graduate-Level Google-Proof Q&A Benchmark

First "Google-proof" PhD-level benchmark where even highly skilled non-expert validators only reach 34% accuracy after 30 minutes of unrestricted web search, testing deep specialist knowledge of AI models that cannot be found by simple web lookup.

Common pitfalls

Small dataset size (448 questions)

MEDIUM

Small dataset size can cause high variance in results between runs.

Run multiple trials and report confidence intervals.

Critical subset distinction

HIGH

Scores on GPQA Diamond vs Extended differ substantially; reporting a score without specifying the subset is misleading.

Always report the subset name alongside the score.

Reference implementations

GPQA – Hugging Face Datasetofficial

Python

GENESIS · Source paper

GPQA: A Graduate-Level Google-Proof Q&A Benchmark

2023arXiv 2023David Rein, Betty Li Hou, Asa Cooper Stickland et al.

2023

GPQA published (arXiv, November 2023)

breakthrough

Rein et al. introduce 448 PhD-level questions; GPT-4 achieves 39%, non-experts 34%.

2024

GPQA Diamond becomes standard frontier AI benchmark

GPT-4o, Claude 3 Opus, and Gemini Ultra report GPQA Diamond scores as a frontier capabilities measure.

Hardware agnosticPRIMARY

Text-based benchmark independent of evaluation hardware.

Commonly used with

MMLU

MMLU (Massive Multitask Language Understanding) is a benchmark proposed by Hendrycks et al. in 2021, covering 57 domains ranging from elementary mathematics, US history, and computer science to law, medicine, and ethics. The dataset contains over 14,000 multiple-choice questions drawn from academic and professional exams. MMLU became the de facto standard for measuring general knowledge and reasoning capabilities of LLMs; it revealed that early models barely exceeded random chance accuracy (25%), while 2023-2024 models achieve above 85-90%, leading to harder successors (MMLU-Pro, GPQA).

GO TO CONCEPT

MMLU-Pro

MMLU-Pro is an enhanced benchmark introduced by Wang et al. (2024), developed in response to saturation of the original MMLU by modern models. Key changes from MMLU: (1) answer choices expanded from 4 to 10, reducing random guessing effectiveness; (2) trivial and noisy questions removed; (3) reasoning-focused multi-step questions added, where CoT outperforms direct answering (unlike original MMLU). MMLU-Pro causes a 16-33% drop in model accuracy compared to MMLU and reduces score sensitivity to prompt variations from 4-5% to 2%. Accepted at NeurIPS 2024 (Spotlight).

GO TO CONCEPT