Key Points
1. We propose TTT layers, a new class of sequence modeling layers where the hidden state is a model, and the update rule is self-supervised learning. This perspective that the forward pass of a layer contains a training loop itself opens up a new direction for future research.
2. TTT-Linear, one simple instantiation of TTT layers, outperforms Transformers and Mamba in our evaluations ranging from 125M to 1.3B parameters.
3. We improve the hardware efficiency of TTT layers through mini-batch TTT and the dual form, making TTT-Linear already a practical building block for LLMs.
4. The TTT layer with a linear model and batch GD is equivalent to linear attention, a widely known RNN layer.
5. The TTT layer with the Nadaraya-Watson estimator is equivalent to self-attention.
6. TTT-Linear and TTT-MLP have better performance than Mamba in long context, with the advantage widening as context length grows longer.
7. Treating context length as a hyperparameter, TTT-Linear and TTT-MLP achieve the best overall performance.
8. With preliminary systems optimization, TTT-Linear is already faster than Transformer at 8k context and matches Mamba in wall-clock time.
9. TTT-MLP still faces challenges in memory I/O, but shows larger potential in long context, pointing to a promising direction for future research.
Summary
This paper proposes a new class of sequence modeling layers called Test-Time Training (TTT) layers, which have linear complexity and an expressive hidden state. The key idea is to make the hidden state a machine learning model itself, and the update rule a step of self-supervised learning. The authors introduce two instantiations - TTT-Linear and TTT-MLP - that utilize test-time training for updating the hidden state.
The paper evaluates TTT-Linear and TTT-MLP at the scale of 125M to 1.3B parameters and compares them to a strong Transformer baseline and a modern RNN called Mamba. The results show that both TTT-Linear and TTT-MLP match or exceed the baselines. TTT-Linear is already faster than Transformer at 8k context and matches the modern RNN Mamba in wall-clock time, while TTT-MLP shows potential in long context.
The authors propose two practical innovations to make the TTT layer efficient in wall-clock time - mini-batch TTT and a dual form for operations inside each TTT mini-batch. The paper also discusses that the TTT layer with a linear model and batch gradient descent is equivalent to linear attention, a widely known RNN layer.
The paper suggests that the TTT framework opens up a new direction for future research, as it reformulates supervised learning as learning to learn, with two nested loops. The authors outline promising directions for future work, including exploring more sophisticated instantiations of the self-supervised task, further systems optimization, and scaling to longer context and larger models.
Reference: https://arxiv.org/abs/2407.04620