Paper of the Week — Training Language Models to Self-Correct via Reinforcement Learning
SCoRe trains a single LLM to catch and fix its own mistakes via RL — 15.6% better on math, 9.1% on code, no multi-model pipeline needed.
Training Language Models to Self-Correct via Reinforcement Learning
Huang, Welleck, Yao, et al. Published 2024-09-12. arXiv:2409.12917
One sentence summary
SCoRe shows you can train a single language model to reliably identify and fix its own mistakes, eliminating the need for complex multi-model correction pipelines while improving accuracy across reasoning tasks.
Why this paper
Self-correction is now baked into most frontier model APIs (Claude’s extended thinking, OpenAI’s o-series), but the underlying technique is poorly understood by practitioners. SCoRe is the clearest demonstration of how RL-based self-correction actually works, and its findings directly inform when to trust — and when not to trust — iterative refinement in production.
What they did
The researchers developed SCoRe (Self-Correction via Reinforcement Learning), a method that trains language models to identify and fix their own mistakes without requiring external feedback or multiple model calls. Instead of the typical approach of using separate models for generation and verification, they trained a single model to iteratively improve its own outputs.
Key findings
- Models trained with SCoRe improved their self-correction abilities by 15.6% on math problems and 9.1% on code generation tasks compared to standard fine-tuning
- The approach works across model scales, showing consistent gains from 7B to 70B parameter models
- Self-correction improved with multiple iterations, with the best results typically achieved after 2-3 correction rounds
- Unlike previous methods, SCoRe doesn’t require training separate critic models or access to ground truth during correction
- The technique reduced hallucination rates by 23% on factual question-answering tasks
Why it matters for practitioners
This addresses one of the biggest pain points in production LLM deployments: getting models to catch and fix their own errors reliably. Instead of building complex multi-model pipelines with separate verification steps, you can now train a single model that self-improves, reducing inference costs and latency while improving output quality.
What you can use today
- The training methodology can be applied to any base model — the authors provide implementation details for adapting SCoRe to your domain-specific tasks
- Apply the multi-round inference pattern: prompt your current models to review and revise their outputs, even without SCoRe training, for immediate quality improvements
- Use their evaluation framework to benchmark self-correction capabilities in your own models before and after fine-tuning