Key Points

1. Training on high-quality synthetic data from strong language models (LMs) is a common strategy to improve the reasoning performance of LMs.

2. The paper investigates the trade-offs between generating synthetic data using a stronger but more expensive (SE) model versus a weaker but cheaper (WC) model.

3. The paper evaluates the generated data across three key metrics: coverage, diversity, and false positive rate. The data from WC models may have higher coverage and diversity, but also exhibit higher false positive rates.

4. The paper explores three finetuning setups: knowledge distillation, self-improvement, and a novel weak-to-strong improvement setup where a weaker LM teaches reasoning to a stronger LM.

5. The paper finds that models finetuned on WC-generated data consistently outperform those trained on SE-generated data across multiple benchmarks and multiple choices of WC and SE models.

6. The results challenge the prevailing practice of relying on SE models for synthetic data generation, suggesting that WC may be the compute-optimal approach for training advanced LM reasoners.

7. The paper shows that at a fixed sampling compute budget, repeated sampling from a smaller model can achieve higher coverage and diversity than from a strong but more expensive model.

8. Fine-tuning LMs with data from the small LM can consistently outperform data from the large LM under the same compute budget.

9. The results establish a solid foundation for training the next generation of LM reasoners, especially as the performance gap between small and large LMs continues to narrow over time.

Summary

Results and Comparisons
This research paper examines the trade-offs between using a stronger but more expensive language model (SE) versus a weaker but cheaper language model (WC) to generate synthetic data for training large language model (LLM) reasoners. The paper evaluates the generated data on three key metrics: coverage (number of unique problems solved), diversity (average number of unique solutions per problem), and false positive rate (percentage of solutions with correct final answers but incorrect reasoning).

The results show that although the WC model has a higher false positive rate, it can generate more samples within a fixed compute budget and thus achieves higher coverage and diversity compared to the SE model. Specifically, the Gemma2-9B (WC) model achieved 11% higher coverage and 86% higher diversity than the Gemma2-27B (SE) model on the MATH dataset.

The paper then compares the performance of LLMs finetuned on the synthetic data from the WC and SE models across three finetuning paradigms: knowledge distillation, self-improvement, and a new "weak-to-strong improvement" setup where a weaker model is used to improve a stronger model. Across multiple benchmarks, the models finetuned on the WC-generated data consistently outperformed those trained on the SE-generated data, with relative gains of up to 31.6%.

Weak Models for Enhanced Reasoning
These results challenge the prevailing practice of relying on stronger models for synthetic data generation, and suggest that the weaker but cheaper model may be the more compute-optimal approach for training advanced LLM reasoners, especially as the performance gap between small and large LLMs continues to narrow over time. The paper also introduces a new "weak-to-strong improvement" paradigm, where a weaker model is used to enhance the reasoning capabilities of a stronger model.

Reference: https://arxiv.org/abs/2408.167...