The Benchmark — SuperGLUE — Stochastic Sandbox

SuperGLUE

A collection of eight challenging natural language understanding tasks designed to evaluate whether language models can handle diverse reasoning and linguistic phenomena beyond surface-level pattern matching.

What it measures

SuperGLUE tests a range of NLU capabilities: textual entailment, question answering, coreference resolution, word sense disambiguation, and commonsense reasoning. The benchmark includes tasks like determining whether one sentence logically follows from another (RTE), identifying which pronoun refers to which noun across sentences (WSC), and answering multiple-choice questions that require reading comprehension and inference (RACE, MultiRC). It’s intentionally designed to be harder than its predecessor GLUE, targeting weaknesses that models had begun to saturate.

Why it was created

GLUE, released in 2018, had become too easy—by 2019, models were already exceeding human performance on most constituent tasks. SuperGLUE (published in 2019 by Wang et al. at NYU and Google Brain) was created to provide a more challenging benchmark that would remain diagnostic for years. The authors wanted tasks that required genuine understanding rather than shallow heuristics: harder entailment cases, adversarially-constructed examples, and phenomena where humans themselves sometimes disagree.

How it works

SuperGLUE contains eight tasks with a total of ~16,000 examples. Scoring is task-specific: some use accuracy, others use F1 or Matthew’s correlation coefficient (a stricter metric for imbalanced datasets). The final SuperGLUE score is a macro-averaged score across all eight tasks, scaled to a 0-100 range where 100 is theoretically perfect. Tasks include BoolQ (yes/no questions requiring inference, ~3,700 examples), MultiRC (multiple-choice reading comprehension, ~27K examples), and RTE (textual entailment, ~2.5K examples—intentionally small to prevent overfitting). Human performance on the benchmark, measured through crowdsourcing, is approximately 89.8%.

What scores mean in practice

Human baseline: 89.8%
Current leading models (early 2024): GPT-4 scores ~96%, Claude 3 Opus ~95%, with strong open models (Llama 3, Mistral Large) reaching ~88-91%
SOTA two years ago (early 2022): ~89-90% (FLAN-T5-XXL, PaLM)
Interpretation: A score of 90+ indicates strong general language understanding. Below 85% suggests significant gaps in reasoning or linguistic phenomena. The benchmark has compressed significantly—models are now beating human performance consistently, which means it’s less discriminative than intended.

Known limitations

Saturation and ceiling effects: Most frontier models now exceed human baseline (89.8%), making it difficult to differentiate between strong models. A 2-3 point spread at the top doesn’t tell you much about practical capability differences.
Small dataset sizes and potential contamination: Some tasks have very few training examples (RTE has 2.5K), raising concerns about test set leakage into training data. By 2023, researchers documented evidence of SuperGLUE contamination in web-trained models.
Narrow linguistic phenomena: While diverse, these eight tasks don’t cover emerging challenges like long-context reasoning, structured knowledge, multi-hop reasoning at scale, or robustness to adversarial examples. Performance here doesn’t guarantee real-world robustness.

When to trust it (and when not to)

Trust it for: Comparing models in the 80-89% range, where there’s still meaningful spread. Also useful as one signal in a portfolio of benchmarks to confirm general NLU competence. It remains valuable for detecting regression.
Don’t rely on it alone for: Differentiating between state-of-the-art models (too saturated), assessing reasoning on novel domains, or making claims about human-level understanding. A high SuperGLUE score in 2024 is almost table stakes, not a distinguishing claim. Pair it with GPQA, MATH, or domain-specific evaluations.