Key Points

1. The paper constructed a dataset called GSMIR that contains mathematical problems with irrelevant information, designed to be more thematically relevant and logically connected to the original content compared to previous datasets.

2. The paper investigated the reasons why large language models (LLMs) are affected by irrelevant information, finding that while LLMs can identify irrelevant information, they are unable to effectively self-exclude it during the reasoning process.

3. The paper proposed a novel method called ATF (Analysis to Filtration Prompting) to enhance the robustness of LLMs against irrelevant information. ATF operates in two steps: analysis to identify irrelevant information, followed by filtration to remove it.

4. Experimental results showed that ATF significantly improves the reasoning accuracy of LLMs when dealing with problems containing irrelevant information, across various prompting methods.

5. The paper found that the demonstration data used in the identification step of ATF does not need to match the format of the test data, indicating that LLMs recognize irrelevant information based on the content rather than just learning templates.

6. Further analysis revealed that the irrelevant information that LLMs fail to recognize using ATF is often "weak irrelevant information" that does not significantly interfere with the reasoning process.

7. The paper demonstrated that ATF is highly effective in filtering out "strong irrelevant information" that genuinely interferes with the reasoning of LLMs.

8. The paper discussed the limitation of only considering scenarios with a single piece of irrelevant information, and suggested future research explore methods for handling multiple pieces of irrelevant information.

9. The paper highlighted the need to study the performance of ATF with different LLM architectures as future work.

Summary

Research on Large Language Models and Mathematics Problem Solving
This research paper investigates the reasoning capabilities of large language models (LLMs) when presented with mathematics problems containing irrelevant information. The authors constructed a dataset called GSMIR, which contains such problems, and tested prominent LLMs and prompting techniques on this dataset. The paper reveals that while LLMs can identify irrelevant information in problem descriptions, they do not effectively mitigate the interference it causes once identified. To address this issue, the authors propose a novel method called ATF (Analysis to Filtration Prompting) that aims to enhance the ability of LLMs to identify and self-mitigate the influence of irrelevant information.

Analysis and Filtration Method of ATF
ATF operates in two steps: first, analysis of irrelevant information, followed by its filtering. In the analysis phase, ATF uses prompts to guide LLMs in breaking down the input problem description, analyzing each clause to determine whether it contains irrelevant information, and providing reasons for its conclusions. In the filtration phase, ATF uses prompts to guide LLMs in filtering out sentences deemed to contain irrelevant information from the problem description, producing a new problem description for the LLMs to reason over using advanced prompting techniques.

Experimental Results and Comparison with Other Prompting Techniques
Experimental results on the GSMIR dataset demonstrate that ATF significantly improves the reasoning accuracy of LLMs when dealing with problems containing irrelevant information, regardless of the prompting method used (Standard, Chain-of-Thought, Zero-shot Chain-of-Thought, Least-to-Most, or Instructed Prompting). The authors also find that the demonstration data used in the identification step of ATF does not rely on learning templates and formats, indicating that LLMs can recognize irrelevant information effectively. Overall, this research highlights the importance of enhancing the robustness of LLMs against irrelevant information in reasoning tasks, and the proposed ATF method represents a significant step towards addressing this challenge.

Reference: https://arxiv.org/abs/2408.10615