Key Points

1. The paper shows that attention can be viewed as a special Recurrent Neural Network (RNN) with the ability to compute its many-to-one RNN output efficiently.

2. The paper demonstrates that popular attention-based models such as Transformers and Perceivers can be viewed as RNN variants.

3. Unlike traditional RNNs (e.g., LSTMs), these attention-based models cannot be updated efficiently with new tokens, which is an important property in sequence modeling.

4. The paper introduces a new efficient method of computing attention's many-to-many RNN output based on the parallel prefix scan algorithm.

5. Building on the new attention formulation, the paper introduces Aaren, an attention-based module that can be trained in parallel like Transformers and also efficiently updated with new tokens, requiring only constant memory for inferences like traditional RNNs.

6. Empirically, the paper shows that Aarens achieve comparable performance to Transformers on 38 datasets across four popular sequential problem settings: reinforcement learning, event forecasting, time series classification, and time series forecasting tasks, while being more time and memory-efficient.

7. The paper discusses the close relationship between Aaren and other approximations of attention, such as RWKV, RetNet, and Linear Transformer, which also aim to linearize the standard softmax-based attention.

8. The paper highlights the use of the parallel prefix scan algorithm to efficiently compute attention's many-to-many RNN output, and discusses the various efficient parallelized algorithms available for this problem.

9. The paper acknowledges the limitation that Aarens' attention queries are input-independent, unlike Transformers, which could be a limitation in settings that require large, highly expressive sequence models.

Summary

The Emergence of Transformers
The advent of Transformers marked a significant breakthrough in sequence modeling, providing a highly performant architecture capable of leveraging GPU parallelism. However, Transformers are computationally expensive at inference time, limiting their applications, particularly in low-resource settings. To address this, the researchers show that attention can be viewed as a special Recurrent Neural Network (RNN) with the ability to compute its many-to-one RNN output efficiently. They then demonstrate that popular attention-based models such as Transformers can be viewed as RNN variants.

Efficient Modeling with Attention-Based Models
Importantly, unlike traditional RNNs (e.g., LSTMs), these attention-based models cannot be updated efficiently with new tokens, which is an important property in sequence modeling. To tackle this, the researchers introduce a new efficient method of computing attention's many-to-many RNN output based on the parallel prefix scan algorithm. Building on this, they introduce Aaren, an attention-based module that can not only be trained in parallel (like Transformers) but also be updated efficiently with new tokens, requiring only constant memory for inferences (like traditional RNNs).

Performance and Efficiency of Aarens
Empirically, the researchers show that Aarens achieve comparable performance to Transformers on 38 datasets spread across four popular sequential problem settings: reinforcement learning, event forecasting, time series classification, and time series forecasting tasks. Crucially, Aarens are more time and memory-efficient than Transformers. This makes Aarens particularly well-suited for low-resource domains such as battery-powered devices, where computational efficiency is crucial.

Addressing Computational Complexity with Aaren
The researchers begin by examining attention, the component that contributes to Transformers' quadratic computational complexity. They show that attention can be viewed as a special RNN, and that popular attention-based models can also be formulated as RNNs. However, unlike traditional RNNs, these attention-based models cannot be efficiently updated with new tokens, limiting their potential in sequential problem settings. To address this, the researchers introduce a new efficient method of computing attention's many-to-many RNN output using the parallel prefix scan algorithm. Building on this, they propose Aaren, a computationally efficient module that can be trained in parallel like Transformers and updated efficiently with new tokens like traditional RNNs.

Reference: https://arxiv.org/abs/2405.139...