Key Points

1. The paper introduces the hierarchical reasoning aggregation framework AoR (Aggregation of Reasoning) to address the limitations of Large Language Models (LLMs) in complex reasoning tasks. The framework selects answers based on the evaluation of reasoning chains and incorporates dynamic sampling to adjust the number of reasoning chains based on the complexity of the task.

2. It is observed that the current approach of ensembling reasoning chains based on the frequency of answers is insufficient, as it fails to accurately select the correct answer when incorrect answers outnumber the correct ones. This prompts the need for the hierarchical reasoning aggregation framework AoR.

3. The AoR approach involves two key stages: local-scoring and global-evaluation. In the local-scoring phase, reasoning chains are scored within each answer group based on the rigor of the reasoning process and appropriateness of the steps. In the global-evaluation phase, the most coherent reasoning chain is chosen based on the scores derived from the global evaluation phase, and this chain's corresponding answer is selected as the final output.

4. The dynamic sampling process in AoR adjusts the number of reasoning chains based on the LLM's confidence in the optimal reasoning chain. This process effectively balances precision in outcomes with optimal use of computational resources.

5. Experimental results across mathematical, commonsense, and symbolic reasoning tasks demonstrate that AoR outperforms several strong baselines, including Chain-of-Thought prompting, Self-Consistency, and Progressive-Hint Prompting. AoR consistently achieves significant performance improvements across all the reasoning tasks.

6. The paper details the specific configuration and settings for the experimentation, including the prompting techniques used, backbone LLMs employed, and the implementation details for AoR, such as sampling reasoning chains, setting scores, and deciding the termination criterion for dynamic sampling.

7. AoR's effectiveness is demonstrated on various reasoning tasks, including mathematical reasoning, commonsense reasoning, and symbolic reasoning, by showcasing its superior performance compared to baseline methods across multiple datasets.

8. Performance comparisons between AoR and several baseline methods on mathematical, commonsense, and symbolic reasoning tasks are provided, highlighting AoR's consistent and significant performance improvements across diverse reasoning tasks and datasets.

9. An illustrative example of the dynamic sampling process on the AQuA and GSM8K datasets is presented to demonstrate the effectiveness of the dynamic sampling strategy in reducing unnecessary computational efforts and focusing more rigorously on analyzing complex queries.

Suymmary

The paper discusses the recent advancements in Chain-of-Thought (CoT) prompting and introduces the Aggregation of Reasoning (AoR) framework to enhance answer selection in Large Language Models (LLMs) for complex reasoning tasks. The authors highlight the limitations of current reasoning performance in LLMs, especially in scenarios where the correct answers are in the minority. To address these shortcomings, they propose the hierarchical reasoning aggregation framework AoR, which selects answers based on the evaluation of reasoning chains. Additionally, AoR incorporates dynamic sampling, adjusting the number of reasoning chains according to the task's complexity. Experimental results show that AoR outperforms prominent ensemble methods and achieves superior performance compared to current methods.

The paper begins by discussing the limitations of LLMs in complex reasoning tasks and introduces the CoT prompting technique to address these limitations. The CoT technique involves generating a series of intermediate steps that lead to the final answer, thereby simplifying the complexity of each step and offering a new perspective to complex reasoning tasks. The authors highlight the randomness in the single-chain sampling process of CoT and propose modulating the sampling temperature to collect a diverse set of reasoning chains, and then select the most consistent answer as the final prediction.

Challenges Faced by LLMs in Answer Selection
The paper provides an illustrative example from the AQuA dataset to demonstrate the challenges faced by LLMs in selecting the correct answers due to the overwhelming presence of erroneous candidates. The authors conduct a pilot analysis on samples across various reasoning tasks and observe that the majority voting mechanism fails in scenarios where the incorrect answers outnumber the correct ones.

Hierarchical Reasoning Aggregation Framework (AoR) Design
Motivated by these limitations, the authors introduce the hierarchical reasoning aggregation framework AoR, which focuses on the reasoning chains rather than solely relying on the predicted answers. AoR initiates the aggregation of chains based on their respective answers and incorporates a two-phase evaluation process. The local-scoring phase evaluates reasoning chains with identical answers, emphasizing the soundness of the reasoning process and the appropriateness of the reasoning steps. The global-evaluation phase then assesses the most logically coherent and methodically valid chains from different answer groups to identify the final answer.

Additionally, AoR includes dynamic sampling, where the number of reasoning chains is adjusted based on the complexity of the task. The paper describes in detail the dynamic sampling process, where additional reasoning chains are sampled and evaluated to adjust the confidence level of the LLM in its optimal reasoning process and answer.
<b>Experimental Results and AoR's Performance across Reasoning Tasks</b>

The experimental results across various reasoning tasks demonstrate that AoR effectively enhances the reasoning performance of LLMs. The authors compare AoR to several strong baselines and demonstrate its superior performance and cost efficiency. The paper presents a detailed comparison of AoR with representative reasoning chain ensemble methods and provides comprehensive results for mathematical reasoning, commonsense reasoning, and symbolic reasoning tasks. The experimental findings show that AoR consistently outperforms the baselines across different reasoning tasks, showcasing its effectiveness in mitigating the limitations of current reasoning performance in LLMs.

In conclusion, the paper introduces the hierarchical reasoning aggregation framework AoR as a promising approach to enhance answer selection in LLMs for complex reasoning tasks. The detailed experimental results and findings support the effectiveness of AoR in addressing the limitations of current reasoning performance in LLMs and demonstrate its superiority compared to prominent ensemble methods. The paper provides valuable insights into the advancements in Chain-of-Thought prompting and the development of the Aggregation of Reasoning framework AoR for improving LLMs' reasoning capabilities.

Reference: https://arxiv.org/abs/2405.129...