Key Points

1. Recent advancements in large language models (LLMs) like GPT-4 Code Interpreter have shown significant progress in addressing math reasoning problems, particularly in solving challenging math datasets.

2. OpenAI's GPT-4 Code Interpreter exhibits proficiency in generating and executing code incrementally, presenting the executed code's output back to the LLM, and achieving an impressive zero-shot accuracy on the challenging MATH dataset.

3. Previous attempts to enhance LLMs' mathematical reasoning abilities include the Chain-of-Thought (CoT) framework and utilizing the Python programming interpreter to improve computational accuracy.

4. GPT-4 Code Interpreter's strong performance in solving math problems is attributed to its ability to generate and execute code, adjust its problem-solving strategies based on feedback from code execution (self-debugging), and code-based solution verification.

5. The introduction of an explicit code-based self-verification (CSV) prompt further enhances GPT-4 Code Interpreter's mathematical problem-solving ability by guiding the model to verify its answers using code, leading to improved accuracy.

6. The proposed verification-guided weighted majority voting strategy incorporates the verification states into the majority voting process, resulting in improved reliability and accuracy of the verification process.

7. Experimental results demonstrate the effectiveness of CSV and verification-guided weighted majority voting, leading to state-of-the-art performance on MATH, GSM8K, and MMLU-Math datasets.

8. GPT-4 Code Interpreter's proficiency varies across different subjects, with the model showing particularly marked deficiency in certain domains.

9. Examples of GPT-4 Code Interpreter's self-debugging and code-based validation processes illustrate its ability to correct errors and refine solutions based on the results of code execution.

Summary

The paper discusses recent progress in large language models (LLMs) and focuses on OpenAI's GPT-4 Code Interpreter, a powerful model for addressing mathematical reasoning problems. The study explores the impact of code on enhancing LLMs' reasoning capability and proposes a novel self-verification method to further boost mathematical reasoning potential. The authors introduce a new approach called explicit code-based self-verification (CSV) to guide GPT-4 Code Interpreter to verify and adjust its answers using code. The study demonstrates that CSV significantly improves the model's accuracy on challenging math datasets.

Additionally, the paper provides a systematic analysis of code usage frequency in GPT-4 Code Interpreter and presents compelling insights into the model's code generation and self-debugging mechanisms. Furthermore, the authors introduce a verification-guided weighted majority voting strategy, which leverages the verification states to improve the accuracy of the majority voting method. The study also presents detailed experiment results on the MATH and MMLU datasets, showcasing the effectiveness of the proposed methods in enhancing GPT-4 Code Interpreter's mathematical problem-solving abilities. The paper highlights the potential for future applications of these methods to other LLMs and domains and provides examples of GPT-4 Code Interpreter's performance in various mathematical problems. The authors also compare the performance of GPT-4 Code Interpreter with other models on the MATH and MMLU datasets, demonstrating its state-of-the-art results across different subjects.

Additionally, the authors provide an in-depth analysis of the confusion matrix for model verification and discuss the Python packages used in their experiments. Throughout the paper, the authors emphasize the significance of code-based self-verification and its potential to create more accurate and detailed datasets for enhancing open-source LLM models' mathematical reasoning capabilities.

Reference: https://arxiv.org/abs/2308.07921