Key Points

1. Reinforcement Learning from Human Feedback (RLHF) has become a key technique for integrating human preference signals into machine learning methods, particularly in aligning Large Language Models (LLMs) with human values and preferences, and has attracted significant interest in diverse communities.


2. The paper outlines the workflow of Online Iterative Reinforcement Learning from Human Feedback (RLHF) and aims to fill the gap in open-source RLHF projects, which are largely confined to the offline learning setting.


3. The paper introduces the concept of Reward Modeling as Human Feedback Approximation and presents the Bradley-Terry Reward Model and the Preference Model used in the study.


4. It identifies previous RLHF algorithms, categorizing them into deep RL-based approach using Proximal Policy Optimization (PPO) and (offline) direct preference learning approaches.


5. The paper discusses the challenges of previous RLHF methods, such as the need for extensive efforts in hyper-parameter selection and code-level optimization, and the limitations of direct preference learning algorithms, including over-optimization and unavailability of preference oracle in the training process.


6. It outlines the theoretical insights and algorithmic principles behind Online Iterative RLHF, emphasizing the hybrid batch learning and non-symmetric structure to balance exploitation and exploration.


7. The paper presents the main algorithms for Online Iterative RLHF, including the MLE policy, exploration policy, prompt set, and data generation, and it discusses the theoretical insights and implementation details.


8. It evaluates the models developed using standard benchmarks, including AlpacaEval-2, MT-Bench, and Chat-Arena-Hard, as well as academic benchmarks, to measure conversation abilities and reasoning and calibration abilities.


9. The paper discusses the impact of reward model, length penalty, and iterative RLHF on the model's response length, academic task performance, and verbosity bias.

Summary

Technical Workflow and Benchmarks

The technical report presents the workflow of Online Iterative Reinforcement Learning from Human Feedback (RLHF). The paper addresses the current limitation of open-source RLHF projects being largely confined to offline learning setting and provides a detailed recipe for easy replication of online iterative RLHF. The workflow involves constructing preference models using open-source datasets and using a proxy preference model to approximate human feedback. The theoretical insights and algorithmic principles behind online iterative RLHF are discussed, followed by a detailed practical implementation. The trained model, SFR-Iterative-DPO-LLaMA-3-8B-R, achieves impressive performance on LLM chatbot benchmarks and academic benchmarks. The paper also makes models, curated datasets, and comprehensive code guidebooks publicly available.

Preference Model Construction and Implementation

To construct the preference model, a Bradley-Terry (BT) reward model and preference model are used. Evaluation results show that the preference model outperformed the BT model in reasoning tasks related to coding and math. Additionally, the practical insights and implementation details for online iterative RLHF are provided. The algorithmic framework for online iterative RLHF is presented, along with hybrid batch learning, a non-symmetric structure to balance exploitation and exploration, and a strategic exploration method. The study successfully demonstrated the advantages of online iterative RLHF and provided a detailed guideline for replicating the results.

The performance of the resulting model, SFR-Iterative-DPO-LLaMA-3-8B-R, was evaluated on benchmarks such as AlpacaEval-2, MT-Bench, Chat-Arena-Hard, and academic benchmarks. The model outperformed previous open-source models in conversation and instruction-following benchmarks. Moreover, the model’s performance on academic benchmarks was comparable to that of the supervised fine-tuning (SFT) checkpoint, and even outperformed the SFT model in some cases. The paper also includes an ablation study on the impact of reward models and length penalty in the online iterative RLHF, which demonstrated that iterative RLHF heavily relies on the quality of the preference signal and requires careful consideration of verbosity bias in reward modeling.

Future Directions and Conclusion

The authors concluded by highlighting future directions for exploration, including designing a more effective strategy to model different types of preference signals and exploring more effective ways for exploration. The study aims to advance the direction of online iterative RLHF and contribute to the training of stronger and larger open-source language models.


Reference: https://arxiv.org/abs/2405.07863v1