FrontierMath
Expert-level mathematics benchmark of original, unpublished problems created by research mathematicians, where current frontier AI solves under 2% of problems β revealing a vast gap between AI capabilities and the prowess of the mathematical community.
Expert mathematicians create original problems outside the scope of existing published materials. Each problem has a verifiable answer (number, formula, mathematical object). Results are checked automatically using a Python/Mathematica interpreter. Questions are not released to public AI models until after they have answered.
Saturation of existing mathematical benchmarks (e.g. MATH, AMC) by frontier models; absence of a reliable measure of the distance between AI capabilities and those of contemporary research mathematicians.
Common pitfalls
Dataset not fully publicMEDIUM
FrontierMath does not release questions publicly to prevent contamination, requiring controlled access for evaluation.
Contact the authors to obtain evaluation access.
Reference implementations
GENESIS Β· Source paper
FrontierMath: A Benchmark for Evaluating Advanced Mathematical Reasoning in AIFrontierMath published (arXiv, November 2024)
breakthroughGlazer et al. from Epoch AI introduce the research mathematics benchmark; frontier AI solves <2% of problems.
Math benchmark independent of hardware; verification via Python interpreter.
| Title | Publisher | Type |
|---|---|---|
| FrontierMath: A Benchmark for Evaluating Advanced Mathematical Reasoning in AI | arXiv | scientific article |
| FrontierMath β Epoch AI official page | Epoch AI | official website |