Key Points

1. The scalability limitations of Transformers regarding sequence length have renewed interest in recurrent sequence models that are parallelizable during training.

2. Many novel recurrent architectures like S4, Mamba, and Aaren have been proposed that achieve comparable performance to Transformers.

3. The authors revisit traditional recurrent neural networks (RNNs) like LSTMs and GRUs, which were previously considered too slow due to requiring backpropagation through time (BPTT).

4. By removing the hidden state dependencies from the input, forget, and update gates, LSTMs and GRUs no longer need BPTT and can be efficiently trained in parallel.

5. The authors introduce minimal versions of LSTMs and GRUs (minLSTMs and minGRUs) that use significantly fewer parameters and are fully parallelizable during training, achieving a 175x speedup for a sequence length of 512.

6. The stripped-down versions of decade-old RNNs match the empirical performance of recent sequence models on tasks like the Selective Copying task, reinforcement learning, and language modeling.

7. Many recent recurrent models like Mamba and attention-based models can be trained efficiently using the parallel prefix scan algorithm, which the authors show also applies to their minimal RNN variants.

8. Compared to traditional RNNs, the minimal RNNs are stable and time-independent in scale, avoiding potential instabilities that plague other recurrent models.

9. The strong empirical performance of these simplified RNNs and their fundamental similarities with many recently proposed recurrent sequence methods leads the authors to question "Were RNNs all we needed?"

Summary

The paper revisits traditional recurrent neural network (RNN) models like LSTMs and GRUs, which were previously deprecated due to scalability limitations in training. The researchers show that by removing the hidden state dependencies from the input, forget, and update gates of LSTMs and GRUs, these models no longer need backpropagation through time (BPTT) and can instead be trained efficiently in parallel.

Introduction of Minimal Versions

Building on this, the researchers introduce minimal versions of LSTMs and GRUs, called minLSTMs and minGRUs. These stripped-down models use significantly fewer parameters than their traditional counterparts and are fully parallelizable during training. For example, the researchers found that minGRUs and minLSTMs were 175x and 235x faster per training step than regular GRUs and LSTMs on a T4 GPU for a sequence length of 512.

Performance of Minimal Versions
Despite these efficiency gains, the paper shows that the minimal versions of LSTMs and GRUs can match the empirical performance of more recently proposed sequence models like Mamba, S4, H3, and Hyena on tasks like the Selective Copying task and reinforcement learning on the D4RL benchmark. The authors find that minGRU and minLSTM are able to solve the Selective Copying task, achieving comparable results to the Mamba S6 model, while outperforming other baselines.

Performance on Reinforcement Learning Tasks
On the D4RL reinforcement learning tasks, minLSTM and minGRU outperform Decision S4 and perform comparably to Decision Transformer, Decision Aaren, and Decision Mamba variants. The authors note that the core recurrent component of these diverse sequence models is remarkably similar, with many leveraging parallel scan algorithms for efficient training.

Conclusion and Future Directions
The paper concludes by questioning "Were RNNs all we needed?", given the strong empirical performance of these simplified versions of decade-old RNNs compared to more recently proposed sequence models. The authors suggest that the fundamental similarities between minimal RNNs and state-of-the-art methods warrant further investigation into the role of traditional RNNs in modern sequence modeling.

Reference: https://arxiv.org/abs/2410.01201