Key Points

1. The advancement of large language models (LLMs) for real-world applications hinges critically on enhancing their reasoning capabilities.

2. The work explores the reasoning abilities of LLMs through their geometrical understanding, establishing a connection between the expressive power of LLMs and the density of their self-attention graphs.

3. The analysis demonstrates that the density of the self-attention graphs defines the intrinsic dimension of the inputs to the MLP blocks, and a higher intrinsic dimension implies greater expressive capacity of the LLM.

4. Increasing the model size and context length facilitates higher attention density and consequently better reasoning.

5. The work provides empirical evidence linking the geometric framework to recent advancements in methods aimed at enhancing the reasoning capabilities of LLMs.

6. The geometry of the transformer layer, a key component in LLMs, is characterized, showing that the density of interaction between the tokens in the self-attention or multi-head attention (MHA) module exemplifies the complexity of function representation achievable by the MLP layer.

7. Experiments reveal that as the number of examples provided in the prompt increases, the intrinsic dimension of the LLM also rises, and a significant rise in the intrinsic dimension at the final layer strongly correlates with enhanced reasoning performance.

8. The geometry of the LLM's internal representations plays a crucial role in its ability to reason effectively, and increasing the intrinsic dimension can improve the LLM's reasoning capabilities.

9. The work provides a path toward improving reasoning and advancing LLMs while deepening the understanding of the models and their behaviors.

Summary

This research paper explores the relationship between the reasoning capabilities of large language models (LLMs) and the geometry of their self-attention graphs. The key findings are:
1. The density of the self-attention graphs in LLMs is connected to the intrinsic dimension of the inputs to the MLP blocks. A higher intrinsic dimension implies greater expressive capacity of the LLM.
2. The paper provides a theoretical analysis and toy examples demonstrating that the intrinsic dimension of the inputs to the MLP blocks is influenced by two key factors: the number of attention heads and the context length (input sequence length).
3. Empirical evidence is presented linking this geometric framework to recent advancements in methods aimed at enhancing the reasoning capabilities of LLMs. Experiments on the GSM8K-Zero dataset show that increasing the intrinsic dimension, particularly at the final layer of the LLM, strongly correlates with improved reasoning performance.

The paper argues that the geometry of the LLM's internal representations, as characterized by the intrinsic dimension of the self-attention graphs, plays a crucial role in its ability to reason effectively. Increasing the intrinsic dimension, either through adding more attention heads or by providing longer input contexts, enhances the expressive power of the MLP blocks and leads to better reasoning capabilities.

This geometric perspective provides a principled approach to understanding and advancing the reasoning abilities of LLMs, beyond simply scaling up model size or input length. The findings suggest that tailoring the internal representations of LLMs to have higher intrinsic dimension could be a promising avenue for improving their reasoning performance, while potentially mitigating the computational costs associated with larger models and longer inputs.

Reference: https://arxiv.org/abs/2407.02678