The Benchmark — MMLU-Pro — Stochastic Sandbox

MMLU-Pro

A harder version of MMLU that removes easy questions and adds adversarial options to test genuine reasoning rather than pattern matching.

What it measures

MMLU-Pro evaluates broad knowledge across 14 academic domains (chemistry, philosophy, law, biology, etc.) but with substantially harder questions designed to distinguish between models that memorize and models that reason. The benchmark contains multiple-choice questions where incorrect answers are carefully crafted to be plausible distractors rather than obvious wrong choices.

Why it was created

MMLU (released 2020) became so widely used that it saturated—by 2023-2024, leading models were scoring 85%+ on it, making it a poor differentiator. Researchers at OpenAI created MMLU-Pro in late 2023 to address this ceiling effect and distinguish models by their actual reasoning ability rather than their ability to recognize memorized patterns or choose obviously correct answers.

How it works

MMLU-Pro contains 14,042 questions across 14 academic disciplines (same domains as original MMLU). The creators removed the 40% of questions they judged to be easiest, then regenerated the multiple-choice options using adversarial selection: instead of random or obviously wrong distractors, they used an earlier model to generate plausible incorrect answers that require genuine reasoning to distinguish from correct ones. Questions are presented in multiple-choice format with 4 options. Scoring is standard accuracy: percentage of questions answered correctly.

What scores mean in practice

This is where MMLU-Pro becomes useful for practitioners. As of early 2024:

Human experts on their own domain: ~90% (the target ceiling)
GPT-4 (April 2024): 86.7%
Claude 3.5 Sonnet (June 2024): 88.3%
Llama 3.1 405B (September 2024): 85.2%
GPT-4o mini: ~80%

For comparison, the same models score 94-96% on original MMLU, making MMLU-Pro roughly 10-15 percentage points harder. A score of 75% on MMLU-Pro represents solid general knowledge; 80%+ is genuinely strong; 85%+ is approaching expert performance on narrow domains.

Known limitations

Still not reasoning-heavy enough: Multiple-choice format inherently constrains reasoning depth. Even with adversarial distractors, a model can succeed through elimination or pattern matching rather than step-by-step reasoning. Open-ended benchmarks like FrontierMath or AIME better test reasoning.
Domain imbalance and shallow coverage: 14 domains sound broad, but questions are still relatively shallow—a few sentences per question. Specialized domains like theoretical physics or advanced mathematics get only surface-level probing. MMLU-Pro tests “breadth of knowledge” not “depth of expertise.”
Adversarial options may not transfer: The distractors were generated by specific models; they may work less effectively as a challenge for models with different architectures or training approaches, reducing the benchmark’s stability over time.

When to trust it (and when not to)

Trust it for: Comparing models in the 75-90% range where MMLU-Pro still meaningfully differentiates. It’s a useful proxy for “does this model have broad academic knowledge without obvious gaps?” and works well for screening models in production pipelines.
Don’t trust it alone for: Claims about reasoning ability, specialized expertise, or real-world performance. Use it as a supporting signal alongside domain-specific benchmarks (MATH for math, SWE-bench for coding, AIME for competition math). A high MMLU-Pro score doesn’t mean a model can solve novel problems or apply knowledge creatively.