The Benchmark — MATH — Stochastic Sandbox

MATH

Evaluates whether language models can solve competition-style mathematics problems requiring multi-step reasoning and symbolic manipulation.

What it measures

MATH tests a model’s ability to solve problems from high school and undergraduate mathematics competitions — algebra, geometry, precalculus, calculus, and number theory. Unlike benchmarks that ask multiple-choice questions about math concepts, MATH requires models to generate complete solutions step-by-step and arrive at a numerically correct answer. This tests both conceptual understanding and the ability to execute complex reasoning chains without making arithmetic errors.

Why it was created

Released by OpenAI in 2021, MATH was designed to address a gap: existing math benchmarks were either multiple-choice (like MMLU’s math subset) or relatively easy. The creators wanted a benchmark that reflected actual mathematical problem-solving as it appears in competitions like the AMC and AIME, where reasoning depth matters as much as knowing formulas. It became a standard for measuring reasoning capabilities in models, especially after larger models started attempting it seriously.

How it works

MATH contains 12,500 problems sourced from high school and undergraduate math competitions. Each problem is presented with context, and models must generate a full solution ending with a final numerical or symbolic answer. Scoring is exact match — a solution is either completely correct or wrong; partial credit isn’t awarded. Answers are typically in a specific format (e.g., simplified fractions, decimal to 2 places), and the evaluation is strict about formatting. The dataset is split 7,500 training and 5,000 test problems.

What scores mean in practice

Human baseline: A human with strong high school math background typically scores 30-40% on MATH; competition math participants score 60-70%.

Current SOTA (as of early 2025): Claude 3.5 Sonnet reaches ~96.4% on the MATH test set (with chain-of-thought prompting and multiple attempts). GPT-4o scores around 92%. Open models like Llama 3.1 405B reach ~80%.

Context: In 2021 when MATH launched, GPT-3 scored ~5%, and even fine-tuned models struggled to exceed 40%. By 2023, GPT-4 achieved 52% — a landmark moment showing models were developing genuine reasoning capability.

A score above 90% is now table-stakes for claims of “reasoning ability” in frontier models. A score below 70% suggests limited multi-step mathematical reasoning.

Known limitations

Exact-match scoring is brittle: A model that reasons perfectly but makes a single arithmetic error or formats the answer incorrectly gets zero credit. This doesn’t reflect real mathematical understanding as finely as partial credit would, and may penalize certain solution approaches unfairly.
High saturation and potential contamination: With scores now exceeding 96%, the benchmark is losing discriminative power. There’s also documented evidence that training data for frontier models may have included MATH problems or similar competition problems, making it unclear how much performance reflects genuine reasoning vs. memorization.
Narrow scope despite breadth: The benchmark tests competition mathematics, which is a specific problem-solving style emphasizing closed-form answers. It doesn’t test applied math, statistical reasoning, modeling real-world systems, or mathematical intuition — domains where model reasoning is weaker and less studied.

When to trust it (and when not to)

Trust it when: You’re comparing two models directly on their reasoning ability, especially if both score below 90%. It’s a useful signal that a model can chain reasoning steps without losing the thread. It’s particularly informative for models in the 20-75% range where the benchmark still has resolution.
Don’t trust it as: A standalone measure of mathematical reasoning or “general intelligence.” High MATH scores don’t predict real-world problem-solving, applied math, or novel problem types. For frontier models, the score tells you less than it did two years ago — everyone above 90% is in a tier where you need harder benchmarks (like FrontierMath or AIME-level problems) to differentiate.