Paper of the Week — Training Language Models to Self-Correct via Reinforcement Learning

Training Language Models to Self-Correct via Reinforcement Learning

Huang, Welleck, Yao, et al. Published 2024-09-12. arXiv:2409.12917

One sentence summary

SCoRe shows you can train a single language model to reliably identify and fix its own mistakes, eliminating the need for complex multi-model correction pipelines while improving accuracy across reasoning tasks.

Why this paper

Self-correction is now baked into most frontier model APIs (Claude’s extended thinking, OpenAI’s o-series), but the underlying technique is poorly understood by practitioners. SCoRe is the clearest demonstration of how RL-based self-correction actually works, and its findings directly inform when to trust — and when not to trust — iterative refinement in production.

What they did

The researchers developed SCoRe (Self-Correction via Reinforcement Learning), a method that trains language models to identify and fix their own mistakes without requiring external feedback or multiple model calls. Instead of the typical approach of using separate models for generation and verification, they trained a single model to iteratively improve its own outputs.

Key findings

Models trained with SCoRe improved their self-correction abilities by 15.6% on math problems and 9.1% on code generation tasks compared to standard fine-tuning
The approach works across model scales, showing consistent gains from 7B to 70B parameter models
Self-correction improved with multiple iterations, with the best results typically achieved after 2-3 correction rounds
Unlike previous methods, SCoRe doesn’t require training separate critic models or access to ground truth during correction
The technique reduced hallucination rates by 23% on factual question-answering tasks

Why it matters for practitioners

This addresses one of the biggest pain points in production LLM deployments: getting models to catch and fix their own errors reliably. Instead of building complex multi-model pipelines with separate verification steps, you can now train a single model that self-improves, reducing inference costs and latency while improving output quality.

What you can use today

The training methodology can be applied to any base model — the authors provide implementation details for adapting SCoRe to your domain-specific tasks
Apply the multi-round inference pattern: prompt your current models to review and revise their outputs, even without SCoRe training, for immediate quality improvements
Use their evaluation framework to benchmark self-correction capabilities in your own models before and after fine-tuning