The Benchmark — HumanEval — Stochastic Sandbox

HumanEval

Evaluates whether code-generating AI models can write functionally correct Python functions from natural language specifications.

What it measures

HumanEval tests practical programming ability: given a function signature, docstring, and a few test cases as examples, can a model generate code that passes all hidden test cases for that function? The benchmark covers basic algorithmic tasks like string manipulation, list operations, math, and simple data structure problems. It specifically measures whether generated code actually works, not whether it looks reasonable or is stylistically good.

Why it was created

OpenAI released HumanEval in December 2021 to fill a gap: existing code benchmarks were either too easy (matching snippets from training data) or focused on code understanding rather than generation. The benchmark was designed to measure genuine coding ability on tasks that weren’t obviously memorized, at a moment when code generation from language models was suddenly becoming viable.

How it works

The benchmark contains 164 Python programming problems. Each problem provides a function signature, a docstring describing what the function should do, one or two test cases shown in the docstring, and several additional hidden test cases. Models generate candidate solutions, which are executed against all test cases. A solution either passes all tests (1 point) or fails any test (0 points). The final score is the percentage of problems solved. Evaluation requires actually running the generated code, which introduces execution safety considerations but guarantees correctness.

What scores mean in practice

Human programmers (evaluated as a sanity check) score approximately 80% on HumanEval. GPT-4 achieved 88.4% (March 2023), Claude 3.5 Sonnet scores ~92% (as of late 2024), and Claude 3 Opus scores ~88%. Two years ago (late 2022), reaching 50% was genuinely impressive—GPT-3.5 scored around 48%. The benchmark has shown clear capability growth, but the spread between top models is now narrow. Passing 85%+ is now table-stakes for any model claiming serious coding ability.

Known limitations

Dataset contamination and memorization: HumanEval problems are relatively simple and may appear in training data or similar form online. Models may be pattern-matching rather than genuinely reasoning about novel problems. Researchers have documented that even small variations break performance, suggesting brittleness.
Shallow task distribution: All 164 problems are short, self-contained functions with clear specs. Real programming involves refactoring, debugging existing code, understanding large codebases, API design, and handling ambiguous requirements—none of which HumanEval tests.
Saturation among leading models: With top models scoring 88-92%, the benchmark is running out of discriminative power. New solutions like HumanEval+ (stricter test cases) and LiveCodeBench (continuously updated, harder problems) exist precisely because HumanEval no longer differentiates well.

When to trust it (and when not to)

Trust it for: Basic signal on whether a model can generate working code from specifications at all. Good for screening whether a new model is complete non-functional or has minimum coding competence. Useful as a component of a broader evaluation.
Don’t trust it alone for: Claims about real-world coding ability, production readiness, or meaningful performance differences between models scoring 85%+. If you’re choosing between two models, HumanEval alone won’t tell you which is better for your use case. Also unreliable for older models that may be memorizing; check contamination studies first.