Key Points
1. The research paper addresses the issue of length bias in evaluating instruction-following models and proposes the use of length instructions to control models' response length. It is observed that there is a tendency in both humans and models to prefer longer responses over shorter ones in evaluations, and training methods often lead to longer responses due to this bias.
2. The paper introduces the concept of training models to be controlled at inference time with instructions containing desired length constraints. These models are superior in length-instructed evaluations compared to standard instruction following models such as GPT4, Llama 3, and Mixtral.
3. It argues that the expected length of responses is often ill-defined in many queries, making evaluation difficult and affecting training algorithms. To address this, the authors propose including further disambiguating instructions that prescribe the length of the desired response.
4. The study shows that existing state-of-the-art instruction following models fail to adequately follow maximum word length instructions. It evaluates models on length-instructed versions of AlpacaEval 2 and MT-Bench by augmenting existing prompts with length instructions and finds that models violate length constraints, highlighting a significant flaw.
5. The paper introduces a method called Length-Instruction Fine-Tuning (LIFT) involving constructing augmented training data by inserting length instructions in the original prompts. The method uses Direct Preference Optimization (DPO) to train models to follow length instructions.
6. The research demonstrates that LIFT-DPO models show improvement in their ability to control output length while maintaining high response quality, as evidenced by the significant reduction in violation rates compared to standard training models.
7. The paper proposes two metrics for evaluating response quality in the context of length instruction following: violation rates (Vlt%) to measure the percentage of responses that exceed the length constraint, and winrates from pairwise comparisons between model and baseline generations on length-following instructions.
8. The study also investigates the robustness of Length Controlled AlpacaEval and finds that the LC winrate can be manipulated by adjusting the length instructions, indicating the potential gameability of existing evaluation methods.
9. The paper concludes by discussing potential future research directions, such as generalizing length instructions in terms of different measures, addressing other kinds of length instructions, and exploring human desired response lengths across different instructions to enhance the alignment of models with human expectations.
Summary
The research paper investigates the development of models trained to follow length constraints at inference time. The study highlights the exploitation of length bias by training algorithms and compares the effectiveness of models in fulfilling user requests as compared to standard instruction following models. The paper argues that the expected length of responses is challenging to define and evaluates how the ambiguity in evaluating responses affects training algorithms.
The authors propose training models that can be controlled at inference time with instructions containing desired length constraints. They argue that existing state-of-the-art instruction following models fail to adequately follow maximum word length instructions and introduce a method called Length-Instruction Fine-Tuning (LIFT) to improve instruction following models at length instruction following. The LIFT approach involves taking a conventional instruction following dataset and constructing augmented training data by inserting length instructions in the original prompts. The paper evaluates models' length instruction-following ability, and the proposed LIFT-DPO training strategy significantly reduces violation rates and improves win rates compared to standard instruction following models. Moreover, the paper reports on the robustness of the proposed LIFT-DPO models and their performance on AlpacaEval-LI and MT-Bench-LI benchmarks.
The paper addresses the challenge of length bias in general instruction following and provides a way to compare models without length bias, as it does not suffer from the gameability of simply increasing model response length. The proposed method provides more controllability for users in real-world use cases and could potentially address other kinds of length instructions, including addressing the challenge of longer responses due to increased computation allowance. The authors suggest several directions for future research, including exploring human desired response lengths across different instructions, with the potential to further align models with human expectations.
Reference: https://arxiv.org/abs/2406.177...