Key Points

1. Large Language Models (LLMs) are trained on the surface textual form of programs, lacking a semantic understanding of how programs execute at run-time. NExT proposes a method to teach LLMs to inspect the execution traces of programs and reason about their run-time behavior through chain-of-thought (CoT) rationales.

2. NExT uses self-training to bootstrap a synthetic training set of execution-aware rationales that lead to correct task solutions without laborious manual annotation, aiming to improve the ability of LLMs to reason about program execution when solving coding tasks.

3. NExT applies iterative self-training from weak supervision and learns from samples that lead to correct task solutions, improving the model's rationales and repair success rate after each iteration.

4. NExT significantly improves the PaLM 2 model's ability to reason about program execution in natural language, enhancing the program fix rate on repair tasks. The improvement is consistent across different metrics and the model's performance outperforms various LLMs in program repair tasks.

5. Execution traces are critical for the model's reasoning with execution, and NExT leads to consistent improvements in the quality of the model's rationales across different evaluation metrics.

6. The model trained with NExT generalizes well to scenarios where program traces are not available at test time, demonstrating its robustness and improved fix rate even in the absence of traces.

7. Human evaluation of rationale quality confirms that the rationales generated by the model using NExT are of significantly higher quality compared to those from the base model PaLM 2-L, and on par with predictions from the GPT-3.5 model.

8. NExT's approach differs from existing methods by leveraging execution traces to aid the reasoning process, leading to more succinct and targeted rationales.

9. With continued advancements, NExT has the potential to be applied to a broader range of program understanding tasks and expanded to support more programming languages in the future.

Summary

The paper presents NExT, a method designed to teach large language models (LLMs) to understand and reason about the execution traces of programs. The approach involves using self-training to provide LLMs with a synthetic training set of execution-aware rationales, without the need for laborious manual annotation. NExT aims to improve the ability of LLMs to reason about program execution while solving coding tasks.

The proposed method is evaluated on program repair tasks based on two benchmarks: Mbpp-R and HumanEval Fix-Plus. The experiments show that NExT significantly improves the fix rate of a PaLM 2 model by a considerable margin, with the fix rate improvement being 26.1% and 14.3% absolute for Mbpp-R and HumanEval Fix-Plus, respectively. The quality of the rationales is also verified by automated metrics and human raters, showing significantly improved rationale quality.

The paper also explores the generalization ability of NExT, demonstrating that the model can generalize to scenarios where program traces are absent at test-time. It is shown that, even in the absence of execution traces, NExT still achieves a high end-to-end fix rate. The method is compared to strong language models and a self-training-based program repair model, highlighting the superiority of NExT in improving the end-to-end program repair accuracy and rationale quality.

Human Evaluation of NExT
The paper includes an extensive human evaluation of rationale quality, where NExT demonstrates significantly higher-quality rationales compared to the base PaLM 2 and GPT-3.5 models. Moreover, the method outperforms strong language models on both the program repair tasks and the human-rated rationale quality.

Overall, the NExT method stands out for its ability to significantly improve LLMs' reasoning about program execution, demonstrating improved program repair accuracy and rationale quality. The results showcase the effectiveness of the self-training approach in teaching LLMs to reason with code execution while providing a clearer understanding of the underlying program traces.

Reference: https://arxiv.org/abs/2404.146...