Key Points

1. The paper introduces a new method called Reinforced Fine-Tuning (ReFT) to enhance the reasoning capability of Large Language Models (LLMs) for math problem-solving, emphasizing the need for learning from multiple annotated reasoning paths given a question.

2. ReFT first starts with a warm-up stage involving Supervised Fine-Tuning (SFT) and then proceeds to further refine the model through the utilization of an online Reinforcement Learning (RL) algorithm, specifically Proximal Policy Optimization (PPO), to sample multiple correct reasoning paths and learn from them.

3. The proposed ReFT significantly outperforms SFT in terms of performance and generalization ability without relying on extra or augmented training questions, indicating a superior generalization ability and potential for further boosting performance.

4. Extensive experiments on GSM8K, MathQA, and SVAMP datasets demonstrate the significantly improved performance and generalization ability of ReFT compared to SFT.

5. ReFT benefits from both majority voting and reward model reranking at inference-time, further improving its performance, exhibiting compatibility with existing techniques.

6. The paper discusses recent research efforts on CoT prompt design and data engineering, reinforcement learning, and experiments with small model size and ablation study of ReFT.

7. The paper emphasizes that the proposed ReFT approach demonstrates robust generalization, potential for further exploration of the training data with reinforcement learning, and compatibility with existing techniques, showcasing enhanced performance in math problem-solving.

8. The research also highlights potential future work involving the application of offline reinforcement learning techniques, development of a warm-up free method, exploration of the implementation of process-based rewards in reinforcement learning training, and application of ReFT to more general reasoning tasks formalized with CoT.

9. The paper also acknowledges challenges such as the number of epochs required for convergence, the susceptibility to reward hacking, and potential measures to mitigate these challenges.

This summary captures the key points and findings from the paper without including personal opinions or interpretations.

Summary

The research paper introduces a novel fine-tuning approach, reinforced fine-tuning (ReFT), which utilizes reinforcement learning to solve math problems. ReFT is proposed to enhance the generalizability of learning Large Language Models (LLMs) for reasoning, with math problem-solving as an example. The approach first employs Supervised Fine-Tuning (SFT) and then utilizes reinforcement learning with the Proximal Policy Optimization (PPO) algorithm to further fine-tune the model. The study conducted extensive experiments using foundational models, CodeLLAMA and Galactica, on standard mathematical datasets. The experiments demonstrated that ReFT significantly outperforms SFT and further benefits from both majority voting and reward model reranking at inference-time.

Enhanced Generalization of ReFT
The findings indicated that ReFT exhibits enhanced generalization capabilities compared to SFT when trained on the same dataset. It showed improved performance and generalization ability across natural language and program-based Chain-of-Thought (CoT) annotations. The paper also demonstrated that ReFT benefits from both majority voting and reward model reranking at inference time, further improving its performance. However, the study also highlighted issues such as reward hacking, where the policy can be manipulated if the space of final answers is limited.

Experimental Performance of ReFT
The experimental data showed that ReFT consistently achieved better performance over SFT and self-training methods, indicating the robust generalization of ReFT and its potential for further exploration. The study also demonstrated the superior performance of ReFT compared to several publicly available open-source models of comparable sizes in math problem-solving. The paper concludes by outlining potential future work to improve the training efficiency and performance of ReFT, address issues such as reward hacking, and apply the approach to more general reasoning tasks.

Reference: https://arxiv.org/abs/2401.08967