← Home 17 editions

The Benchmark

Series

July 2026

JUL 21 The Benchmark — ARC-Challenge (AI2 Reasoning Challenge)
A plain-English explainer of one AI evaluation benchmark: what it measures, how it works, and when to trust it.
JUL 14 The Benchmark — HellaSwag
A plain-English explainer of one AI evaluation benchmark: what it measures, how it works, and when to trust it.
JUL 7 The Benchmark — BIG-Bench Hard
A plain-English explainer of one AI evaluation benchmark: what it measures, how it works, and when to trust it.

June 2026

May 2026

April 2026

March 2026

MAR 31 The Benchmark — MMLU (Massive Multitask Language Understanding)
A plain-English explainer of one AI evaluation benchmark: what it measures, how it works, and when to trust it.