Key Points

1. Fine-tuning language models on human-generated data is limited by the quantity and diversity of high-quality human data, prompting the exploration of methods to go beyond human data on tasks with scalar feedback.

2. The ReST?? method, based on expectation-maximization, involves generating model samples, filtering them using binary feedback, and fine-tuning the model on these samples, demonstrating favorable scaling with model size and significant performance improvement compared to fine-tuning only on human data.

3. ReST?? has demonstrated success in enhancing language models across diverse domains, including machine translation, semantic parsing, preference alignment, and elementary reasoning, with the potential to substantially reduce dependence on human-generated data.

4. The method has shown significant advancements in mathematical reasoning and code generation capabilities when applied to models of varying scales, with larger models resulting in larger performance gains.

5. ReST?? models substantially improve test performance on challenging benchmarks, and the number of iterations affects the overall performance, potentially leading to diminishing improvements indicating potential overfitting.

6. The method's impact on pass@k and majority voting performance has shown that fine-tuning with ReST?? substantially improves the model's performance, particularly for a fixed number of samples and temperature settings.

7. ReST?? has demonstrated its ability to reduce dependency on human data and show positive transfer to related tasks, providing enhanced performance on related but held-out benchmarks.

8. Experiments have revealed the ability of ReST?? to efficiently improve performance as the dataset size increases, making it a sample-efficient method for language model training.

9. The method has shown promise in enhancing the general capabilities of language models and has performed well on challenging real-world evaluation tasks, showcasing its effectiveness in problem-solving scenarios.

Summary

Research Focus
The paper explores the use of Reinforced Self-Training with External Feedback signal (ReST ??) to improve large language models (LLMs) by generating high-quality synthetic data. The study investigates the effectiveness and scalability of ReST ?? in comparison to human-generated data in complex problem-solving tasks such as mathematical problem-solving (MATH) and code generation (APPS) using PaLM-2 models.

Study Findings
The findings demonstrate significant advancements in mathematical reasoning and code generation capabilities when applying ReST ?? to PaLM 2 models of varying scales, showing improved performance gains compared to models trained on human-written data. The study also reveals diminishing improvements after a certain number of iterations and emphasizes the potential of self-training with feedback to reduce dependence on human data.
The paper discusses the Reinforced Self-Training (ReST) process, which involves generating samples from the model, filtering them using binary feedback, fine-tuning the model on these samples, and repeating this process multiple times. The study also discusses the application of Expectation-Maximization (EM) for Reinforced Self-Training and its impact on enhancing language models across diverse domains.

Evaluation Results
The evaluation demonstrates that ReST ?? is efficient and effective, resulting in improvements in pass@K and majority voting performance. The results also show positive transfer to related tasks and minimal degradation compared to the base model when evaluating fine-tuned models on a broad suite of tasks.

Overall, the study provides evidence that ReST ?? is a promising approach for reducing dependence on human-generated data and improving the performance of language models in problem-solving tasks. The findings suggest that self-training with feedback can substantially enhance the capabilities of language models, providing significant advancements in various domains.

Reference: https://arxiv.org/abs/2312.06585