The Benchmark — GSM8K A plain-English explainer of one AI evaluation benchmark: what it measures, how it works, and when to trust it. 2026-04-21T12:00:00.000Z The Benchmark The Benchmark benchmarksevaluationai-research

The Benchmark — GSM8K

A plain-English explainer of one AI evaluation benchmark: what it measures, how it works, and when to trust it.

One LLM benchmark, explained for people who build with models.

GSM8K

A benchmark of 8,500 grade-school math word problems designed to test arithmetic reasoning and multi-step problem-solving in language models.

What it measures

GSM8K tests whether models can solve elementary arithmetic problems that require reading comprehension, planning, and sequential calculation. The problems span addition, subtraction, multiplication, division, and percentages, typically requiring 2–8 reasoning steps. It’s fundamentally testing whether a model can parse a real-world scenario, identify the relevant numbers, choose the right operations in sequence, and arrive at a numerical answer.

Why it was created

In 2021, OpenAI researchers noticed that while large language models were improving at many tasks, their performance on grade-school math was surprisingly weak and inconsistent. Existing benchmarks like MATH focused on competition-level problems; GSM8K filled a gap by testing basic arithmetic reasoning at a difficulty level most humans master by middle school. The goal was a simple, interpretable benchmark for measuring genuine quantitative reasoning without mathematical notation or symbolic manipulation.

How it works

GSM8K contains 8,500 problems split into 7,473 training examples and 1,319 test examples. Each problem is a short paragraph written in natural language (e.g., “Marcus has 25 donuts. He eats 5 and gives 8 to his friend. How many are left?”). Models generate free-form text responses rather than selecting from multiple choice. Scoring is done via exact match after normalizing numerical answers—the model’s final numerical answer must match the gold answer exactly. Most evaluations report accuracy as the percentage of problems answered correctly.

What scores mean in practice

Human performance: 96% (humans typically solve these with ease).

Current leading models (as of late 2024): GPT-4 scores ~92%, Claude 3.5 Sonnet ~90%, Llama 3.1 70B ~87%.

Two years ago (late 2022): SOTA was around 80% (GPT-3.5-Turbo).

Practical scale: A score above 85% suggests solid arithmetic reasoning. Below 70% indicates the model makes frequent mistakes on multi-step problems. The gap between human and model performance (4–8%) is meaningful but narrower than on harder benchmarks like MATH.

Known limitations

  • Contamination risk is serious. These problems are straightforward and widely circulated in training data. Many models may have seen identical or near-identical problems during pretraining, inflating scores. OpenAI has partially mitigated this with careful data collection, but some leakage is suspected.

  • No reasoning transparency required. Scoring by exact match only on the final answer means models get full credit even if their reasoning is flawed or incoherent. A model could guess the right answer or use wrong math that happens to land on the correct number.

  • Limited to arithmetic. The benchmark doesn’t test algebra, geometry, or calculus reasoning. A model could score high on GSM8K while failing at more abstract mathematical thinking, giving a false sense of mathematical capability.

When to trust it (and when not to)

  • Trust it as a basic sanity check. If a model scores below 75% on GSM8K, it genuinely struggles with simple multi-step arithmetic. High scores (85%+) are a reasonable floor signal, though they don’t prove the model won’t fail on harder math or novel problem structures.

  • Don’t use it alone to compare models on reasoning. GSM8K is easy enough that most modern LLMs cluster in the 80–92% range, making fine distinctions unreliable. It’s better paired with harder benchmarks like MATH or AIME for real differentiation. Also treat reported scores skeptically if the evaluating team hasn’t shown detailed contamination analysis.