Key Points

1. Most models exhibit a clear gap between their performance on the original GSM8K test and the compositional GSM test, which undermines their reliability and reasoning ability.

2. The reasoning gap is particularly evident in small, more cost-efficient models and math-specialized models, reducing their utility in practice.

3. Instruction-following tuning impacts LLMs of varying sizes in significantly different ways, calling for re-examination of standard training recipes.

4. Finetuning with either human or synthetic data on GSM8K problems results in task-specific overfitting with longer training.

5. Smaller models benefit more from generating code solutions rather than natural language to solve compositional problems, emphasizing systematic differences in reasoning abilities.

6. The large reasoning gaps are not due to test-set leakage, but the result of distraction from additional context and poor second-hop reasoning.

7. Cost-efficient and smaller LLMs exhibit a much larger reasoning gap than closed-source frontier LLMs, suggesting their reasoning flaws may be obscured by high scores on prevalent math-reasoning benchmarks.

8. Math-specialized LLMs, particularly smaller models, exhibit similar reasoning gaps and signs of overfitting to standard benchmarks, despite extensive specialized training in mathematics.

9. The researchers' objective is to provide a case study for deeper insights into LLM reasoning and a reassessment of how we evaluate these abilities, rather than simply introducing another reasoning benchmark.

Summary

Problem Overview
This research paper examines the problem-solving capabilities of large language models (LLMs) on grade-school math (GSM) word problems. The key findings are as follows:

Reasoning Gap in LLMs
Most LLMs exhibit a significant "reasoning gap" in their performance on compositional GSM problems compared to their performance on the original GSM8K problems. Compositional GSM problems consist of pairs of math word problems where the answer to the second problem depends on correctly answering the first problem. Despite strong performance on standard GSM8K benchmarks, the majority of LLMs struggle with this more compositional and contextual reasoning task.

Reasoning Gap in Smaller LLMs
This reasoning gap is particularly pronounced in smaller, more cost-efficient LLMs, as well as models that are specialized for math problem-solving. The authors found that instruction-tuning and finetuning approaches had varying effects across LLM sizes, with smaller models benefitting more from instruction-tuning on the original GSM8K problems but showing less improvement on the compositional GSM tasks.

Reasoning Gap Analysis
The analysis suggests that the reasoning gaps are not due to test set leakage, but rather stem from LLMs getting easily distracted by the additional context provided in the compositional problems and struggling with the second-hop reasoning required to solve them. Even when LLMs could correctly answer the first problem, they often made subtle mistakes when trying to use that information to solve the second problem.

Overall, the findings demonstrate that strong performance on standard math reasoning benchmarks does not necessarily translate to robust compositional reasoning abilities. The authors argue that more nuanced evaluation approaches are needed to accurately assess the reasoning capabilities of LLMs, beyond just measuring performance on individual problems. This case study highlights systematic differences in the reasoning abilities of LLMs, even among models of similar size or specialization, which has important implications for the development and deployment of these systems.

Reference: https://arxiv.org/abs/2410.01748