Key Points

1. Large-scale unsupervised language models (LMs) acquire surprising capabilities but might need precise control over their behavior.

2. Existing methods typically steer LMs to match human preferences using reinforcement learning (RL), which can be complex and unstable.

3. DPO introduces a new parameterization of the reward model that allows the optimization of the RL-based objective with a simple binary cross-entropy loss.

4. DPO relies on a theoretical preference model to optimize a policy using human preferences over model responses, without explicit reward modeling or reinforcement learning.

5. DPO can fine-tune LMs to align with human preferences as well as or better than existing methods, including PPO-based RLHF, in tasks such as sentiment modulation, summarization, and dialogue.

6. DPO outperforms PPO-based RLHF in controlling sentiment of generations and matches or improves response quality in summarization and single-turn dialogue.

Summary

The paper introduces DPO, a method for fine-tuning large unsupervised language models (LMs) to align with human preferences without reinforcement learning. The Direct Preference Optimization (DPO) algorithm optimizes a language model to adhere to human preferences, offering a stable, performant, and computationally lightweight approach. The paper demonstrates that DPO can fine-tune LMs as effectively as or better than existing methods, especially in tasks such as sentiment modulation, summarization, and single-turn dialogue. The large-scale unsupervised LMs, trained on diverse human-generated data, often struggle to achieve precise control of their behavior due to the completely unsupervised nature of their training. Existing methods for addressing this issue can be complex and unstable.

The paper shows that DPO resolves these issues by directly optimizing the policy to satisfy preferences using a simple binary cross-entropy objective, eliminating the need for sampling from the LM during fine-tuning or performing significant hyperparameter tuning.

The paper provides insights into the effectiveness of DPO compared to existing methods for training language models from preferences. Additionally, it discusses the challenges of applying reinforcement learning algorithms on a large scale and outlines potential areas for future research and application of the DPO approach.

The paper introduces the Direct Preference Optimization (DPO) algorithm as a solution to the challenges posed by large unsupervised language models trained on diverse human-generated data. The DPO algorithm aims to optimize a language model to align with human preferences without using explicit reward modeling or reinforcement learning. The authors demonstrate the effectiveness of DPO in tasks such as sentiment modulation, summarization, and dialogue using language models with up to 6 billion parameters.

The paper provides detailed insights into the experimental results comparing the effectiveness of DPO with other methods. Additionally, the paper includes the derivation of the DPO objective under different preference models, implementation details of DPO, and further experimental setup details. The authors also present examples of comparisons between DPO and the baseline and discuss the performance of the Best of N baseline for various N.

Moreover, the paper discusses the use of GPT-4 for computing win rates and presents the results of a human study that collected human preference data for several matchups in the TL;DR summarization setting.

Reference: https://arxiv.org/abs/2305.18290