Key Points
1. Current approaches commonly train reward models from human preferences, which may be limited by human performance level and the frozen reward models' inability to improve during training.
2. The study introduces Self-Rewarding Language Models, which use their own language model via Judge prompting to provide self-rewards during training. This approach aims to avoid the bottleneck caused by human preference data by continually updating the reward model during training.
3. The approach consists of two essential skills in the model: instruction following and self-instruction creation, which enable the model to perform self-alignment through AI Feedback (AIF).
4. The training process is iterative, involving self-instruction creation, where the model generates candidate responses and evaluates them, and AI Feedback Training (AIFT), which augments the seed data with additional examples for training.
5. The article compares iterative training results, demonstrating improvements in instruction following and reward modeling ability over successive iterations using AIFT data generated by the reward model from the previous iteration.
6. The models are evaluated using head-to-head performance comparisons and on the AlpacaEval 2.0 leaderboard, demonstrating improved win rates over existing models.
7. The study also discusses the importance of LLM-as-a-Judge prompt and the significance of using AI Feedback Training for improving the model's ability to assign self-rewards for further iterations.
8. The experimental results suggest promising potential for continual improvement of language models beyond human preferences using the self-rewarding training procedure.
9. The article acknowledges the preliminary nature of the results and suggests avenues for further research, including safety evaluations, understanding the limits of iterative training, and scalability of the approach over different language models and settings.
Summary
The research paper introduces Self-Rewarding Language Models (SRLMs) as a novel approach to training large language models (LLMs). The key concept of SRLMs is to develop an agent that possesses both instruction following and reward modeling abilities, rather than separating them into distinct models. The proposed approach involves training LLMs using an Iterative Direct Preference Optimization framework. Through experimental results, the paper demonstrates that the instruction following performance and reward modeling ability of SRLMs improve compared to baseline models, suggesting the potential for obtaining superior LLMs and reward models.
Iterative Training of SRLMs
The paper describes the process of training SRLMs using an iterative approach, where the model creates its own preference-based instruction training data by assigning rewards to its own generations via LLM-as-a-Judge prompting, and then uses Iterative Direct Preference Optimization to train on the preferences. The results show that SRLMs outperform existing systems on the AlpacaEval 2.0 leaderboard, indicating the effectiveness of the proposed approach.
Related Works and Techniques
The paper outlines related works in the field, such as Reinforcement Learning from Human Feedback (RLHF), Reinforcement Learning from AI Feedback (RLAIF), and methods for improving LLMs via data augmentation. It also discusses the importance of the LLM-as-a-Judge prompt and the prompt used to evaluate the model's performance on Evaluation Fine-Tuning (EFT) data.
Future Research and Implications
The authors acknowledge that their results are preliminary and highlight the need for further evaluation, safety assessment, and understanding the scaling laws of the iterative training approach. They also point out that the research opens up a new avenue of study for continually improving language models beyond human preferences, potentially leading to models with superior reward modeling and instruction following capabilities.
Reference: https://arxiv.org/abs/2401.10020