The Benchmark — GPQA (Graduate-Level Google-Proof Q&A)
A plain-English explainer of one AI evaluation benchmark: what it measures, how it works, and when to trust it.
GPQA (Graduate-Level Google-Proof Q&A)
Evaluates whether models can answer extremely difficult, counterintuitive multiple-choice questions that even domain experts struggle with and that Google searches don’t easily resolve.
What it measures
GPQA tests deep domain knowledge in biology, physics, and chemistry at the graduate level. It consists of 448 multiple-choice questions where the correct answer is often counterintuitive—the kind of question where a reasonable person might confidently pick the wrong option. The benchmark deliberately includes questions where naive web search doesn’t surface the answer directly.
Why it was created
GPQA was introduced by Google DeepMind in late 2023 to address a critical problem: existing benchmarks like MMLU were becoming saturated (Claude 3.5 scores 98%+, GPT-4 scores 86%+), making it hard to differentiate between frontier models. The creators wanted a benchmark where humans with PhDs in the relevant fields still make mistakes—so model performance would reflect genuine understanding rather than pattern matching or retrieval.
How it works
GPQA contains 448 questions across biology, physics, and chemistry, split into a test set only. Each question is multiple choice with four options. The dataset was created by recruiting domain experts (PhD-level scientists) to write questions, then having other experts verify them and confirm that the questions are genuinely difficult. Scoring is straightforward: percentage of correct answers. The benchmark explicitly avoids questions where standard web search returns the answer directly on the first page.
What scores mean in practice
Human performance (PhD-level experts in the relevant field) averages around 65% correct—this is a critical reference point. Leading models as of late 2024 score roughly:
- GPT-4o: ~88%
- Claude 3.5 Sonnet: ~90%
- o1 (reasoning model): ~92%
Two years ago (late 2022), GPT-3.5 would have scored around 30-40%. The benchmark has proven genuinely difficult to saturate: even top models fail on a meaningful portion of questions.
Known limitations
-
Expert labeling isn’t infallible: Even PhD-level experts sometimes disagree on “correct” answers in ambiguous cases. The benchmark uses consensus, but edge cases remain. Some questions may have contextual nuance that even the creators didn’t fully capture.
-
Potential contamination in pretraining: These are graduate-level textbook questions, and textbooks were likely in model training data. It’s unclear how much performance reflects genuine reasoning vs. memorized facts from training. Google’s attempt to filter for “Google-proof” questions helps but isn’t a complete solution.
-
Very small domain coverage: 448 questions across only three scientific disciplines means the benchmark may not generalize to other knowledge-intensive domains (medicine, law, engineering). Performance on GPQA doesn’t necessarily predict performance on similarly difficult questions in other fields.
When to trust it (and when not to)
-
Trust it for: Comparing frontier models on hard, domain-specific reasoning tasks. If you care about whether a model can handle graduate-level science questions that require genuine knowledge, not surface-level pattern matching, GPQA is a strong signal. It’s one of the few benchmarks where human experts don’t approach saturation.
-
Don’t trust it for: General reasoning ability or practical problem-solving. A model’s GPQA score doesn’t tell you much about whether it can debug code, write clear documentation, or handle real-world tasks. It’s a narrow, specific test optimized to be hard—which is partly why it’s useful, but also partly why it shouldn’t be used as a holistic capability measure.