Key Points

1. Large language models (LLMs) can spend extra compute during inference to generate intermediate thoughts, which helps to produce better final responses. This is referred to as "System 2" reasoning.

2. Many System 2 techniques have been proposed, such as Chain-of-Thought, Rephrase and Respond, System 2 Attention, and Branch-Solve-Merge. These methods can improve accuracy but at a higher inference cost.

3. The authors investigate self-supervised methods to "distill" the higher quality outputs from System 2 techniques back into the base LLM (the "System 1" model), without the need to generate the intermediate reasoning steps.

4. The distillation process involves collecting filtered training examples by running System 2 approaches on unlabeled data, and then fine-tuning the base LLM to match the higher quality outputs.

5. Experiments are conducted across 4 different System 2 LLM approaches and 5 different tasks, showing that System 2 reasoning can often be successfully distilled into System 1.

6. The distilled System 1 models can sometimes even outperform the original System 2 models, while requiring much less inference cost.

7. However, the authors also show that not all tasks can be effectively distilled into System 1, particularly complex math reasoning tasks requiring chain-of-thought.

8. The concept of distilling System 2 into System 1 is analogous to how humans learn to transfer skills from deliberate (System 2) to automatic (System 1) processing.

9. The authors posit that System 2 distillation will be an important feature of future continually learning AI systems, enabling them to focus System 2 capabilities on the reasoning tasks they cannot yet do well.

Summary

This paper investigates methods to "distill" the higher-quality reasoning capabilities of more computationally intensive "System 2" approaches into the base "System 1" language model.

System 2 methods like Chain-of-Thought, Rephrase and Respond, and Branch-Solve-Merge generate intermediate reasoning steps during inference to produce better final outputs, but at a much higher computational cost. The goal of this work is to develop self-supervised techniques to compress this System 2 reasoning back into the base System 1 language model, without needing to generate the intermediate steps.

This allows the model to benefit from the improved outputs of System 2 while maintaining the efficiency of System 1. The authors experiment with distilling four different System 2 approaches - Rephrase and Respond, System 2 Attention, Branch-Solve-Merge, and Chain-of-Thought - across a variety of tasks.

They first run the System 2 approaches on unlabeled data to generate higher-quality outputs, and then use an unsupervised curation step to select the most consistent and reliable of these generated targets. They then fine-tune the base System 1 model to match these distilled targets. The results show that this approach is successful in many cases, with the distilled System 1 model matching or even outperforming the original System 2 approaches, while requiring far less computational resources.

For tasks like last letter concatenation and coin flip reasoning, the distilled model achieves near-perfect performance.

However, the authors also find that not all tasks can be effectively distilled, particularly more complex math reasoning problems that require the full chain-of-thought process.

The authors posit that this type of System 2 distillation will be an important technique for future AI systems, allowing them to focus their more intensive reasoning capabilities on the specific tasks they struggle with, while compiling that reasoning into their base model for efficient inference. This mirrors the way humans develop automaticity through practice and consolidation of skills from System 2 to System 1 reasoning.

Overall, this work presents a promising approach for enhancing the reasoning capabilities of language models without the computational overhead of full System 2 processing, a key challenge for making these models practical for real-world deployment.

Reference: https://arxiv.org/abs/2407.06023v1