The Benchmark — MMLU (Massive Multitask Language Understanding)

MMLU (Massive Multitask Language Understanding)

Evaluates whether language models can answer multiple-choice questions across 57 academic and professional domains at varying difficulty levels.

What it measures

MMLU tests broad factual knowledge and reasoning across humanities, social sciences, STEM, and professional fields. The benchmark includes questions from elementary school through professional certification exams—so it spans both easy recall (“What is the capital of France?”) and harder reasoning tasks (organic chemistry, economics, law). It’s designed to measure general knowledge rather than specialized depth in any single domain.

Why it was created

MMLU was introduced by Hendrycks et al. in May 2020 as a response to the need for a large-scale, multidisciplinary evaluation of language model understanding. Before MMLU, most benchmarks tested narrow domains (reading comprehension, commonsense reasoning, etc.). The creators wanted to measure whether models could handle the breadth of knowledge required for human academic and professional tasks. It became the de facto standard for evaluating LLMs because it was comprehensive and relatively straightforward to score.

How it works

MMLU contains 15,908 multiple-choice questions across 57 subjects (math, history, biology, business, law, medicine, psychology, etc.). Each question has four options (A, B, C, D), and the model is scored on accuracy—one point for the correct answer, zero otherwise. The benchmark is divided into few-shot evaluation (the model sees 5 example Q&A pairs per subject before testing) and zero-shot (no examples). Reported scores typically use the few-shot setting. The final score is the weighted average across all 57 subjects, though researchers often report both overall scores and per-subject breakdowns.

What scores mean in practice

Human baseline: College-educated adults score roughly 65% on average (reported by benchmark creators); PhDs score ~75%. Current leading models: GPT-4 scores ~86% (as of early 2024); Claude 3.5 Sonnet scores ~88%; open-source leaders like Llama 3.1 (405B) score ~85%. Historical context: In mid-2022, GPT-3.5 scored ~71%; by late 2023, GPT-4 at ~86% represented a substantial leap. A score of 70% two years ago would have been competitive; today, anything below 80% for a frontier model signals limitations.

The spread matters: models often excel in narrow domains (99% on computer science basics) while struggling in others (70% on virology). A model scoring 85% overall might be weaker than expected in law or medicine.

Known limitations

Saturation risk: Top models now exceed 85%, leaving less room to differentiate between very capable systems. MMLU-Pro and other variants exist partly because original MMLU is becoming less useful for ranking leading models.
Multiple-choice bias: The four-option format is easier than open-ended questions. Models can exploit patterns in wrong answers without true understanding. Real-world knowledge tasks often require generating explanations, not selecting from options.
Dataset contamination: Evidence suggests some MMLU questions appear in training data of recent models (especially GPT-4), inflating scores. The benchmark creators have tried to mitigate this, but full transparency is unclear.

When to trust it (and when not to)

Trust it for: Broad-brush comparisons of general knowledge across model families (e.g., is Claude or Llama stronger overall?), and screening for major capability gaps. MMLU is still useful as a quick health check.
Don’t trust it alone for: Specialized evaluations (if you care about medical reasoning, use specialized medical QA benchmarks), measuring reasoning depth (the format doesn’t require it), or differentiating between current frontier models (the gap is too small and saturation is real).

For practical use: cite MMLU if it’s part of a suite of benchmarks, not as the primary signal of model quality.