The Benchmark — SimpleQA — Stochastic Sandbox

SimpleQA

Measures whether language models can answer factual questions correctly and know when they don’t know something.

What it measures

SimpleQA tests factual recall and calibration—can a model answer straightforward factual questions accurately, and critically, does it admit uncertainty when it shouldn’t claim to know? The benchmark includes short-answer questions on people, places, dates, and other concrete facts that have verifiable correct answers. It’s designed to catch both wrong answers and overconfident hallucinations.

Why it was created

Released by OpenAI in late 2024, SimpleQA addresses a practical problem: existing benchmarks often reward confident-sounding answers regardless of accuracy, and they don’t penalize models for making up facts with false confidence. As LLMs moved into production systems where users rely on factual accuracy, the gap between “sounds plausible” and “actually correct” became critical to measure.

How it works

SimpleQA contains approximately 4,000 naturally-phrased factual questions with short, verifiable answers (names, dates, numbers, locations). Questions are sourced from existing datasets and hand-verified. Scoring is strict: answers are marked correct only if they match the ground truth (with minor formatting tolerance), incorrect if wrong, and unanswered if the model declines to answer. The benchmark measures three metrics: accuracy (% correct), refusal rate (% unanswered), and accuracy among answered questions. A model’s final score balances correctness against overconfidence—high refusal on hard questions is penalized less than wrong answers.

What scores mean in practice

As of January 2025, GPT-4o scores approximately 92% accuracy on SimpleQA, Claude 3.5 Sonnet scores around 88%, and Gemini 2.0 Flash scores approximately 87%. For context, this doesn’t mean models are “92% truthful”—these are easy factual questions without adversarial setup. On harder or longer-context variants, scores drop significantly. A model scoring 85%+ is generally reliable for straightforward factual lookups; below 75%, manual verification becomes necessary. Two years ago (2023), GPT-3.5 would have scored around 70–75% on comparable factual benchmarks.

Known limitations

Question simplicity: SimpleQA intentionally uses straightforward, unambiguous factual questions. It doesn’t test reasoning, synthesis, or questions where facts are contested or context-dependent. A model could score 90% here while confidently hallucinating in more complex scenarios.
Refusal gaming: Models can artificially inflate accuracy by refusing to answer uncertain questions. The benchmark penalizes refusal rate, but the penalty may not fully capture real-world costs—in some applications, a refused answer is worse than a confident wrong one.
Limited domain coverage: The questions cluster around relatively accessible facts (famous people, major events, dates). Specialized domains like cutting-edge research, niche historical events, or recent news may not be represented proportionally, limiting how well this predicts performance in expert domains.

When to trust it (and when not to)

Trust it for: Evaluating whether a model is suitable for FAQ systems, knowledge bases, or customer support where factual accuracy on common questions matters. It’s a useful sanity check that your model isn’t confidently making things up on basic facts.
Don’t rely on it alone for: Assessing reasoning, multimodal understanding, specialized expertise, or long-form generation. A high SimpleQA score doesn’t guarantee a model won’t hallucinate in more complex tasks. Also don’t use it as your only signal for real-time performance—facts become stale, and the benchmark’s ground truth is fixed.