Key Points
1. The authors developed a multi-turn online reinforcement learning (RL) approach called SCoRe that significantly improves a large language model's (LLM) self-correction ability using entirely self-generated data.
2. Existing approaches for training self-correction either require multiple models or rely on a more capable model or other forms of supervision, which SCoRe avoids.
3. The authors found that variants of supervised fine-tuning (SFT) on offline model-generated correction traces are insufficient for instilling self-correction behavior, as they either suffer from a distribution mismatch or implicitly prefer only a certain mode of correction behavior.
4. SCoRe addresses the challenges of distribution mismatch and mode collapse by training under the model's own distribution of self-generated correction traces and using appropriate regularization.
5. When applied to Gemini 1.0 Pro and 1.5 Flash models, SCoRe achieves state-of-the-art self-correction performance, improving the base models' self-correction by 15.6% and 9.1% respectively on the MATH and HumanEval benchmarks.
6. The authors' analysis shows that reinforcement learning plays an essential role in self-learned self-correction, as opposed to just using supervised fine-tuning approaches.
7. SCoRe trains the model in two stages - the first stage trains a model initialization that optimizes correction performance while constraining the first attempt to be close to the base model, and the second stage runs multi-turn RL to optimize reward at both attempts.
8. The reward shaping in the second stage of SCoRe encourages improving responses from the first attempt to the second, preventing the model from simply learning to produce the best first-attempt response and only minorly editing it.
9. SCoRe is the first approach to attain significantly positive intrinsic self-correction, demonstrating the importance of the two-stage training process and the reward shaping technique.
Summary
The paper explores the challenge of developing self-correction capabilities in large language models (LLMs), which has proven to be a difficult problem. Existing approaches either require multiple models or rely on more capable models or other forms of supervision. To address this, the researchers developed a multi-turn online reinforcement learning (RL) approach called SCoRe that significantly improves an LLM's self-correction ability using only self-generated data.
The paper first shows that variants of supervised fine-tuning (SFT) on offline model-generated correction traces are insufficient for instilling effective self-correction behavior. SFT either suffers from a distribution mismatch between the training data and the model's own responses, or it implicitly prefers a limited mode of correction behavior that is often not effective at test time.
SCoRe addresses these challenges by training under the model's own distribution of self-generated correction traces and using appropriate regularization to steer the learning process towards an effective self-correction strategy. SCoRe operates in two stages. In the first stage, it trains a model initialization that produces high-reward revisions at the second attempt, while constraining the first-attempt response distribution to be close to that of the base model. This initialization helps prevent the model from collapsing to simply repeating the first attempt in the second.
In the second stage, SCoRe runs multi-turn RL, using a reward bonus that emphasizes traces where the correctness of the response flips from the first to the second attempt. This reward shaping regularizes the training process away from the "direct" solution of simply producing the best first-attempt response without meaningful self-correction.
When applied to Gemini 1.0 Pro and 1.5 Flash models, SCoRe achieves state-of-the-art self-correction performance, improving the base models' self-correction by 15.6% and 9.1% respectively on the MATH and HumanEval benchmarks. The paper's key contribution is demonstrating that multi-turn RL, with careful initialization and reward shaping, can significantly enhance the self-correction ability of LLMs using only self-generated data, without relying on external supervision or multiple models.
Reference: https://arxiv.org/abs/2409.12917