Summary
In this paper, the authors conduct a comprehensive analysis of the effectiveness of Chain-of-thought (CoT) prompting for eliciting reasoning capabilities from large language models (LLMs). They begin with a quantitative meta-analysis covering over 100 papers that report on the use of CoT, finding that CoT provides strong performance benefits primarily on tasks involving math or logic, with much smaller gains on other types of tasks.
The authors then run their own evaluations of 20 datasets across 14 different LLMs, corroborating the findings from the meta-analysis. They observe that on datasets like MMLU, directly generating the answer without CoT leads to almost identical accuracy as using CoT, unless the question or the model's response contains an equals sign, indicating the presence of symbolic operations and reasoning.
Following this observation, the authors further analyze the behavior of CoT on these symbolic reasoning problems. They find that much of CoT's gain comes from improving the model's ability to execute symbolic computations, but it still underperforms relative to using a separate symbolic solver. This suggests that while CoT can be helpful for certain types of reasoning, it is often unnecessary, as more efficient prompting strategies or tool-augmented models can achieve similar performance at a lower computational cost.
The authors conclude that CoT's utility is often circumscribed by the availability of more powerful tools for specific reasoning tasks. They suggest that moving beyond prompt-based CoT to new paradigms that better leverage intermediate computation could be a fruitful direction for future research, especially for expanding the range of applications where CoT is beneficial beyond math and symbolic reasoning.
Reference: https://arxiv.org/abs/2409.12183