Key Points

1. The article proposes that transformer decoders can be redefined as a form of infinite multi-state RNNs.

2. It introduces a compression policy called TOVA, which outperforms existing policies and can reduce memory consumption during inference by up to 88%.

3. The study demonstrates that pretrained transformer LLMs often behave in practice as finite MSRNNs, despite being trained as infinite MSRNNs.

4. The authors conducted experiments with several long-range tasks, including language modeling, long-range understanding, and text generation, to evaluate the performance of the proposed TOVA policy.

5. Results show that TOVA outperforms other policies in long range understanding tasks and language modeling while using only a fraction of the original cache size.

6. The study analyzes the importance of different tokens in memory and highlights the significance of the very first token in the sequence.

7. The article suggests that the TOVA policy enables an increase in batch size by compressing the cached K, V matrices in transformer auto-regressive decoding.

8. The work places TOVA in the context of related research on hybrid transformer-RNN approaches, new RNN variants, and simplifying transformers.

9. The study acknowledges potential limitations in evaluating models on long text generation and emphasizes the practical value of reducing the memory footprint of transformer LLMs.

Summary

Relationship between Transformers and Recurrent Neural Networks (RNNs)
The paper explores the relationship between transformers and recurrent neural networks (RNNs), focusing on the autoregressive behavior of transformers and their alignment with the core principle of RNNs. The authors demonstrate that decoder-only transformers can be conceptualized as infinite multi-state RNNs and can be converted into finite multi-state RNNs by fixing the size of their hidden state. They introduce a compression policy called TOVA, which selects tokens to keep in the multi-state based on their attention scores, and evaluate its performance on long-range tasks. The TOVA policy outperforms all other baseline policies and shows minimal performance degradation compared to the full (infinite) model, using in some cases only 1/8 of the original cache size. The study reveals that transformer decoder models often behave as finite MSRNNs in practice, leading to substantial reductions in memory consumption during inference.

The paper presents results from various long-range tasks, including language modeling, long-range understanding, and long text generation, and demonstrates the practical benefits of converting pretrained transformers into finite MSRNNs, such as a significant reduction in memory consumption during inference. The paper also sheds light on the behavior of transformer decoder models and highlights the importance of tokens in memory retention and sheds light on the tokens frequently kept and those dropped during the decoding process. The study provides implications for reducing memory consumption, leading to potential benefits for users with limited hardware access.

Practical Value of Reducing Memory Footprint
The paper builds on prior research that has tried to bridge the gap between RNNs and transformers and highlights the potential practical value of reducing the memory footprint of transformer LLMs, thereby potentially increasing their adoption by users with limited hardware access. It also underscores some limitations, such as the computational expense of evaluating models on long text generation, as well as potential differences in the attention mechanism for languages with more flexible word order. The research provides a valuable contribution to the field of natural language processing by offering a practical approach to reducing memory consumption in transformer models and understanding their behavior as finite multi-state RNNs in practice.

Reference: https://arxiv.org/abs/2401.06104