The Benchmark — SWE-bench — Stochastic Sandbox

SWE-bench

Evaluates whether language models can autonomously resolve GitHub issues by writing and modifying code.

What it measures

SWE-bench tests a model’s ability to locate bugs, understand codebases, and generate fixes for real software engineering problems. Rather than isolated coding snippets, it uses actual pull requests from open-source Python repositories, asking models to produce diffs that resolve reported issues. This measures practical software engineering capability—reading error messages, navigating unfamiliar code, and producing patches that pass existing tests.

Why it was created

Existing coding benchmarks like HumanEval tested isolated problem-solving but not real-world software maintenance, where engineers spend most of their time debugging and modifying existing systems. SWE-bench was introduced in 2024 by researchers at Princeton and UC San Diego to bridge this gap. It emerged as models claimed increasing software engineering competence but lacked rigorous evaluation on realistic tasks.

How it works

SWE-bench contains 2,294 real GitHub issues from 12 popular Python projects (Django, scikit-learn, sympy, etc.). For each issue, the benchmark provides: the problem description, the repository code at a specific commit, and the ground-truth fix from the merged pull request. Models attempt to generate a patch that resolves the issue. Scoring uses two methods: (1) resolved/unresolved—does the fix make the issue’s test suite pass?—and (2) exact match—does the generated diff match the real PR? The test-based evaluation is more practical but less reproducible than exact matching.

What scores mean in practice

Claude 3.5 Sonnet (Oct 2024) achieves ~25% on the test-based metric with agentic scaffolding—meaning it successfully resolves about 1 in 4 real GitHub issues. GPT-4o scores around 19% under similar conditions. For context, human software engineers would likely resolve 70-80%+ of issues in codebases they’re moderately familiar with. Two years ago, no model had been systematically evaluated on this benchmark. A 5-10% score would have been respectable in early 2024; 25% represents substantial progress but also clarifies how far models remain from autonomous software maintenance.

Known limitations

Survivorship bias: The benchmark uses resolved issues (merged PRs), excluding rejected fixes or issues that community members abandoned. Models might struggle more on edge cases and disputed issues that never made it to the dataset.
Testing incompleteness: A fix passing existing tests doesn’t guarantee correctness—many real codebases have incomplete test coverage. Models could generate patches that work on the test suite but break functionality in production scenarios not covered.
Contamination risk: Popular Python repositories may be well-represented in training data. It’s unclear how much of models’ performance reflects genuine reasoning vs. memorized patterns from similar code they’ve seen. The exact-match scoring is particularly vulnerable to this.

When to trust it (and when not to)

Trust it for: Comparing relative progress on realistic software tasks and identifying which models are approaching practical usefulness for code review or minor refactoring. It’s concrete and reproducible in ways most coding benchmarks aren’t.
Don’t trust it alone for: Concluding that a model with 25% score is “25% as capable” as a human engineer—the distribution of difficulty is unknown, and real-world software work requires context-switching, communication, and debugging strategies beyond patch generation. Use it as one signal alongside code quality evaluations and practical testing in your specific codebase.