Key Points

1. As large language models (LLMs) exceed human-level capabilities, it becomes increasingly challenging to provide full-scale and accurate supervision for these models. Weak-to-strong learning, which leverages a less capable model to unlock the latent abilities of a stronger model, proves valuable in this context.

2. The efficacy of weak-to-strong learning for complex reasoning tasks is still untested, and methods are currently lacking to avoid blindly imitating the weak supervisor including its errors.

3. The paper introduces a progressive learning framework that enables the strong model to autonomously refine its training data, without requiring input from a more advanced model or human-annotated data.

4. The framework begins with supervised fine-tuning on a selective small but high-quality dataset, followed by preference optimization on contrastive samples identified by the strong model itself.

5. Extensive experiments on the GSM8K and MATH datasets demonstrate that the proposed method significantly enhances the reasoning capabilities of Llama2-70b using three separate weak models.

6. The method is further validated in a forward-looking experimental setup, where Llama3-8b-instruct effectively supervises Llama3-70b on the highly challenging OlympicArena dataset.

7. Full weak fine-tuning, while effective in classification tasks, falls short for complex reasoning tasks.

8. The preference optimization phase enables the strong model to learn from errors made by the weak supervisor, ultimately surpassing the strong model fine-tuned on gold-standard solutions in challenging scenarios.

9. This work paves the way for a more scalable and sophisticated strategy to enhance AI reasoning powers.

Summary

This paper introduces a progressive learning framework that enables a strong language model to autonomously refine its training data without relying on more advanced models or human-annotated data. The framework operates in a weak-to-strong learning setting, where a less capable model is used to provide supervisory signals to a stronger model. The key contributions of this work are: 1. The authors demonstrate that naive fine-tuning of the strong model on the full dataset generated by the weak model is inadequate for complex reasoning tasks, unlike its effectiveness in classification tasks. 2. The proposed method involves two stages of training. In the first stage, the strong model is fine-tuned on a selective subset of data generated by the weak model and the strong model itself, where the consistency of the final answers is used to filter the training data. This stage significantly improves the reasoning capabilities of the strong model compared to naive fine-tuning. 3. In the second stage, the strong model leverages its own confidence to identify contrastive samples, which are then used for preference optimization.

This enables the strong model to learn from the mistakes made by the weaker model, ultimately surpassing the performance of the strong model fine-tuned on ground truth solutions in challenging scenarios. Experiments on the GSM8K and MATH datasets show that the proposed method outperforms naive fine-tuning by a large margin.

Additionally, the authors conduct experiments on the extremely challenging OlympicArena dataset, demonstrating the scalability and generalizability of their approach in approximating future weak-to-strong reasoning scenarios. This work introduces a more scalable and sophisticated strategy for enhancing the reasoning abilities of large language models, paving the way for developing AI systems that can tackle currently unsolvable mathematical and physical challenges.

Reference: https://arxiv.org/abs/2407.13647