The Benchmark — TruthfulQA — Stochastic Sandbox

TruthfulQA

Measures whether language models answer questions truthfully rather than defaulting to plausible-sounding falsehoods or popular misconceptions.

What it measures

TruthfulQA tests a model’s ability to distinguish truth from falsity on factual questions where humans commonly believe wrong answers. The benchmark includes 817 questions across 38 categories: misconceptions in health, history, finance, conspiracy theories, urban legends, and other domains where confident-sounding lies are easy to generate. Unlike knowledge benchmarks that reward “knowing the right answer,” TruthfulQA penalizes models that output widely-believed falsehoods—even if those falsehoods appear frequently in training data.

Why it was created

Created by Lin et al. at Anthropic and published in mid-2021, TruthfulQA was designed to address a specific failure mode: language models that sound authoritative while being wrong. The creators noticed that scaling model size often increases confidence in false statements without increasing accuracy. They built this benchmark to measure something orthogonal to capability—whether models tell the truth when they do generate answers, especially on topics saturated with misinformation in their training data.

How it works

Evaluators ask a model 817 factual questions with known correct answers. For each question, the model generates a free-form response (not multiple choice). Two independent human raters then score each answer on two dimensions: (1) informativeness (1-5 scale: how much relevant information does it contain?), and (2) truthfulness (1-5 scale: how accurate is the answer?). The final score averages these across all questions. Ratings are averaged across raters; disagreements are common and intentional—they reveal ambiguous cases. The benchmark reports both a “truthfulness” score and a “truthfulness × informativeness” product score, since a model could be truthful but uninformative.

What scores mean in practice

Human performance on TruthfulQA is approximately 94% on truthfulness (humans know true answers and state them clearly) and 92% on the combined score. GPT-3 (the original 175B model from 2020) scored only 58% truthfulness and 47% on combined score—it generated confident-sounding falsehoods regularly. As of early 2024, frontier models perform better: Claude 3 Opus reaches ~92% truthfulness, GPT-4 around 86%, and Llama 2 around 62%. These numbers remain below human baseline, meaning all current models still generate falsehoods on a meaningful fraction of questions. For context, a naive baseline of always answering “I don’t know” would score 0 on truthfulness but avoid outright lies.

Known limitations

Subjectivity in rating: Truthfulness judgments aren’t always binary. Questions like “Is homeopathy effective?” involve philosophical disagreement about what “effective” means. Inter-rater agreement exists but isn’t perfect (typically ~70-80% on edge cases), making exact scores somewhat noisy.
Limited domain coverage: 817 questions is small by modern benchmarks standards, and they cluster heavily in health/conspiracy/history. Performance on these categories doesn’t necessarily predict truthfulness on scientific papers, legal documents, or technical specifications where models also operate.
Gaming and data contamination concerns: TruthfulQA has become widely known in model development. Some labs may have specifically fine-tuned models to improve scores, and training data cutoffs mean newer models have seen public discussions of these exact questions online.

When to trust it (and when not to)

Trust it for: Comparing two models’ tendency to hallucinate confident falsehoods on common-knowledge questions. It’s a useful sanity check—a model scoring 50% is clearly worse at avoiding misinformation than one scoring 85%, even if the absolute numbers are hard to interpret.
Don’t trust it alone for: Claims about truthfulness on specialized domains (medicine, law, science) where the benchmark is sparse. Also don’t use it to predict whether a model will hallucinate in long-context reasoning tasks or when generating novel content outside the benchmark’s scope.