Key Points

1. The paper introduces rStar, a self-play mutual reasoning approach that significantly improves the reasoning capabilities of small language models (SLMs) without fine-tuning or superior models.

2. rStar decouples reasoning into a self-play mutual generation-discrimination process. The target SLM augments Monte Carlo Tree Search (MCTS) with a rich set of human-like reasoning actions to construct higher quality reasoning trajectories. Another SLM with similar capabilities acts as a discriminator to verify each trajectory generated by the target SLM.

3. Extensive experiments across five SLMs demonstrate that rStar can effectively solve diverse reasoning problems, including GSM8K, GSM-Hard, MATH, SVAMP, and StrategyQA. For example, rStar boosts GSM8K accuracy from 12.51% to 63.91% for LLaMA2-7B, from 36.46% to 81.88% for Mistral-7B, and from 74.53% to 91.13% for LLaMA3-8B-Instruct.

4. The paper highlights two fundamental challenges that hinder the self-improvement of SLMs: (1) SLMs struggle to effectively explore the solution space during reasoning, often getting trapped in low-quality reasoning steps; and (2) even if high-quality reasoning steps are found, SLMs have difficulty evaluating and selecting the correct final answers.

5. To address these challenges, rStar introduces a richer set of reasoning actions in the MCTS self-exploration and employs a second SLM as a discriminator to provide unsupervised feedback on the generated reasoning trajectories, improving the accuracy of the final solution selection.

6. rStar significantly outperforms state-of-the-art baselines, including single-round inference techniques like few-shot Chain-of-Thought, multi-round prompting approaches such as self-consistency, and self-improvement techniques such as RAP, ToT, self-evaluation and self-verification.

Summary

The research paper introduces rStar, a self-play mutual reasoning approach that significantly enhances reasoning capabilities of small language models (SLMs) without fine-tuning or superior models. rStar decouples reasoning into a self-play mutual generation-discrimination process, comprising a target SLM augmented with a rich set of human-like reasoning actions and another SLM acting as a discriminator to verify the generated reasoning trajectories. The mutually agreed reasoning trajectories are considered mutual consistent and have a higher likelihood of correctness. The paper conducts extensive experiments across five SLMs and various reasoning tasks, including GSM8K, GSM-Hard, MATH, SVAMP, and StrategyQA. The results show significant enhancements in reasoning performance, with rStar boosting GSM8K accuracy from 12.51% to 63.91% for LLaMA2-7B, from 36.46% to 81.88% for Mistral-7B, and from 74.53% to 91.13% for LLaMA3-8B-instruct.

The extensive experiments and ablation studies show that rStar significantly enhances the reasoning accuracy of SLMs across diverse tasks, outperforming state-of-the-art baselines, including single-round inference techniques, multi-round prompting approaches, and self-improvement techniques. The paper also provides an analysis of the effectiveness of self-rewarding in SLMs, inference costs of rStar, and discussions on the importance of the generator and discriminator in the reasoning process. The results demonstrate the success of rStar in significantly improving the reasoning capabilities of SLMs and contribute to the development of advanced SLM self-improved reasoning.

Introduction of the rStar Approach

The research paper presents a novel approach, called rStar, aimed at enhancing the reasoning capabilities of small language models (SLMs) without the need for finetuning or utilizing superior models. The rStar approach introduces a self-play mutual reasoning strategy, where the reasoning process is decomposed into a mutual generation-discrimination procedure. This involves one SLM generating reasoning trajectories using Monte Carlo Tree Search (MCTS) with human-like reasoning actions, and another SLM serving as a discriminator to assess and verify the generated reasoning trajectories.

Evaluation of rStar Approach

To evaluate the effectiveness of the rStar approach, the paper conducts extensive experiments across multiple SLMs to measure their performance on various reasoning tasks. The results of the experiments demonstrate that the rStar approach significantly improves the reasoning capabilities of the SLMs across a range of tasks, without the need for finetuning or reliance on more advanced models. The findings indicate that the rStar approach effectively enhances the reasoning abilities of SLMs, thereby addressing the limitations associated with small language models and contributing to their overall performance improvement in reasoning tasks.

Performance Gains and Demonstration of rStar Approach

Furthermore, the paper discusses the specific tasks used for evaluation, highlighting the substantial performance gains achieved by the rStar approach. The experiments demonstrate the effectiveness of rStar in enabling SLMs to generate more accurate and human-like reasoning trajectories, leading to improved performance on diverse reasoning tasks. Overall, the research showcases the potential of rStar as a valuable methodology for enhancing the reasoning capabilities of small language models, thereby contributing to advancements in natural language processing and artificial intelligence research. In conclusion, the paper introduces the rStar approach as a self-play mutual reasoning strategy to enhance the reasoning abilities of small language models without requiring finetuning or access to superior models. Through extensive experimentation and evaluation, the research demonstrates the significant performance improvements achieved by the rStar approach across various reasoning tasks, showcasing its potential as a valuable technique for advancing the capabilities of small language models in natural language processing and artificial intelligence domains.

The paper discusses the challenges faced by large language models (LLMs) in complex reasoning and the limitations of existing methods such as fine-tuning or relying on superior models for reasoning improvements. It introduces rStar as a novel approach to address these challenges by leveraging the knowledge within SLMs themselves, demonstrating its effectiveness in improving reasoning capabilities without requiring fine-tuning or superior models. rStar decomposes reasoning into a self-play mutual generation-discrimination process where one SLM generates candidate reasoning trajectories using MCTS as the generator, and another SLM provides unsupervised feedback on each trajectory based on sampled partial hints as the discriminator. This mutual reasoning consistency process effectively verifies the correctness of reasoning trajectories, leading to improved SLM performance.

Reference: https://arxiv.org/abs/2408.06195