Key Points
1. Large Language Models (LLMs) are typically trained to answer user questions or follow instructions similarly to how human experts respond, but they lack the basic ability of explicit thinking before answering.
2. Thinking is important for complex questions that require reasoning and planning, and can be applied to any task.
3. The paper proposes a training method called Thought Preference Optimization (TPO) for equipping existing LLMs with thinking abilities for general instruction following without using additional human data.
4. TPO uses an iterative search and optimization procedure that explores the space of possible thought generations, allowing the model to learn how to think without direct supervision.
5. For each instruction, the thought candidates are scored using a judge model that evaluates the responses only, and then optimized via preference optimization.
6. TPO outperforms the direct response baseline and shows gains from thinking on non-reasoning categories such as marketing, health and general knowledge, in addition to more traditional reasoning & problem-solving tasks.
7. TPO does not require labeled thought data or a specialized judge model capable of evaluating thoughts, but instead leverages a standard judge model that only evaluates responses.
8. Experiments on the AlpacaEval and Arena-Hard benchmarks show TPO achieves a strong win rate of 52.5% and 37.3% respectively, outperforming the direct LLM counterpart.
9. The paper argues that "thinking" should have broad utility beyond just math and logic tasks, and the proposed TPO method opens up new opportunities to develop Thinking LLMs for general instruction following.
Summary
The research paper proposes a novel training method for Large Language Models (LLMs) to think before providing a response to user instructions, thereby enhancing their ability to engage in explicit thinking. The method, called Thought Preference Optimization (TPO), allows LLMs to generate thoughts as text and improve their responses through iterative reinforcement learning. The approach enables LLMs to learn how to think without direct supervision, exploring the space of possible thought generations and optimizing thought responses to improve the resulting responses. The study focuses on general instruction following instead of math or logic tasks, arguing that "thinking" should have broad utility, including understanding the user instructions better and planning overall structure and characters in creative writing tasks. The paper highlights the challenge of training LLMs to think due to the lack of supervised training data and the difficulty of collecting human thought data. To address this, the proposed method leverages an iterative training process, combining reinforcement learning from AI feedback and preference optimization to enable LLMs to think independently.
Evaluation
The research study evaluates the proposed TPO method on various benchmarks and tasks, including AlpacaEval and Arena-Hard, demonstrating that TPO outperforms the direct response baseline. The evaluation also reveals that TPO leads to better performance not only on reasoning and problem-solving tasks but also on non-reasoning categories such as general knowledge, marketing, and health.
Thought Generation Approach
The paper also discusses the specific approach for generating thoughts from Thinking LLMs, emphasizing the need for thought generation to be simple and compatible with existing LLM infrastructures. It also addresses the method's ability to optimize thoughts through preference optimization and overcome challenges such as length control and parsing errors.
Experimental Analysis
In addition, the study provides a detailed analysis of the experimental results, including comparisons with baseline models, fine-grained evaluations on a variety of categories, and detailed breakdowns of training instruction data. It also presents examples of non-reasoning and reasoning instructions and evaluates the effectiveness of the TPO method.
Overall Effectiveness
Overall, the paper introduces a novel training method, TPO, for teaching LLMs to think and provides detailed experimental evidence demonstrating its effectiveness in improving the models' performance across a wide variety of tasks and categories. The findings suggest that TPO enables LLMs to engage in explicit thinking, leading to significantly improved performance on both reasoning and non-reasoning tasks.
Reference: https://arxiv.org/abs/2410.10630