Key Points
Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality
1. The paper shows that transformers and state-space models (SSMs) are closely related, and develops a framework called structured state space duality (SSD) to connect the two families of models.
2. The SSD framework allows the authors to design a new architecture called Mamba-2, whose core layer is a refinement of the selective SSM used in the original Mamba model. Mamba-2 is 2-8x faster while remaining competitive with transformers on language modeling.
3. The paper provides an equivalence between state space models and a well-studied family of structured matrices called semiseparable matrices. This connection reveals new properties and algorithms for SSMs.
4. The paper significantly improves the theory of linear attention, providing a simple tensor-contraction-based proof and generalizing it to structured masked attention (SMA).
5. The paper connects SSMs and SMA, showing they have a large intersection that are duals of each other, possessing both SSM-like linear and attention-like quadratic forms.
6. The paper introduces a new SSD algorithm based on block decompositions of semiseparable matrices, which is 2-8x faster than the original Mamba implementation while enabling much larger recurrent state sizes.
7. The paper uses the SSD framework to design the Mamba-2 architecture, which incorporates ideas from attention models such as grouped-value attention and normalization layers. Experiments show Mamba-2 outperforming Mamba and transformer models.
8. The paper leverages the connection between SSMs and attention to enable tensor parallelism, sequence parallelism, and efficient handling of variable-length sequences for the Mamba-2 architecture.
9. The paper provides extensive empirical validation of the Mamba-2 model on language modeling, training efficiency, and a challenging multi-query associative recall task.
Summary
Theoretical Framework
The paper presents a new theoretical framework that connects state-space models (SSMs) and attention-based models like Transformers. It shows that these two families of sequence models are closely related, and develops a rich set of connections between SSMs and variants of attention through the lens of structured semiseparable matrices.
SSM Matrix Transformations
The authors first demonstrate that SSMs can be equivalently represented as matrix transformations, where the SSM operator corresponds to multiplication by a sequentially semiseparable (SS) matrix. This reveals new properties and efficient algorithms for computing SSMs. A key insight is that different methods of computing state space models can be reframed as various matrix multiplication algorithms on structured SS matrices.
Linear Attention and Structured Masked Attention
Building on this, the paper significantly advances the theory of linear attention. It provides a simple tensor contraction-based proof of linear attention, and then generalizes it to a new family of structured masked attention (SMA) models.
The authors show that SSMs and SMA have a large intersection that are duals of each other, possessing both SSM-like linear and attention-like quadratic forms.
These theoretical connections enable new efficient algorithms for SSMs. The paper introduces an SSD algorithm that leverages the structured SS matrix representation, taking advantage of both the linear SSM recurrence and quadratic dual form. This SSD algorithm is 2-8x faster than the optimized Mamba model, while allowing for much larger recurrent state sizes. Empirically, SSD is highly competitive with optimized softmax attention implementations.
The framework also allows the authors to adapt architectural design choices from the attention community to build new improved SSM architectures. They introduce the analog of attention "heads" to SSMs, and use these ideas to design a new Mamba-2 architecture. Mamba-2 modifies the original Mamba block to enable tensor parallelism, and uses the SSD algorithm as the core sequence mixing layer.
Mamba-2 Architecture Performance
Experiments show that the Mamba-2 architecture Pareto dominates both the original Mamba and Transformer++ models in terms of perplexity and wall-clock training time. Mamba-2 also matches or outperforms Mamba and open-source Transformer models on standard language modeling and downstream evaluations, when scaled to similar model sizes.
Finally, the paper describes how the SSD framework enables leveraging a rich set of systems optimizations originally developed for Transformers, such as tensor parallelism, sequence parallelism, and efficient handling of variable-length inputs. These techniques allow Mamba-2 to scale to large models and datasets as efficiently as Transformer-based models.
Reference: https://arxiv.org/abs/2405.21060