Key Points
1. The paper introduces two models, Hawk and Griffin, which integrate a novel gated linear recurrent layer, RG-LRU, and local attention. These models demonstrate exceptional language modeling performance with power-law scaling, exceeding the reported performance of Mamba and Llama-2 on downstream tasks.
2. Hawk and Griffin exhibit superior hardware efficiency during inference with lower latency and significantly higher throughput compared to Transformer baselines. They also demonstrate the ability to extrapolate on longer sequences than those they have been trained on.
3. The models are capable of efficiently learning to copy and retrieve data over long horizons, with Griffin performing slightly better than Hawk and significantly better than Transformer baselines on these tasks.
4. The research addresses the challenge of training recurrent models efficiently on devices, leveraging model parallelism and developing specialized implementations for efficient linear recurrences on TPU-v3.
5. The models achieve competitive training efficiency and performance compared to state-of-the-art SSMs and linear attention algorithms, demonstrating their potential as a powerful and efficient alternative to Transformers with global attention.
6. The paper provides insights into the model's structure, including the residual block, MLP block, and temporal-mixing blocks, as well as the detailed architecture and computational efficiency of the RG-LRU layer and its gating mechanism.
Summary
The research paper introduces two novel recurrent models, Hawk and Griffin, demonstrating exceptional language modeling performance. Hawk, equipped with a new gated linear recurrent layer called RG-LRU, surpasses the reported performance of Mamba on downstream tasks, even when trained on fewer tokens. On the other hand, Griffin, a hybrid model combining the RG-LRU layer with local attention, matches the performance of Llama-2 despite being trained on significantly fewer tokens.
Both models exhibit power-law scaling between held-out loss and training FLOPs. They show hardware efficiency comparable to Transformers during training and lower latency with significantly higher throughput during inference. Furthermore, Griffin scales up to 14B parameters with efficient distributed training capabilities, and the models demonstrate the ability to extrapolate on longer sequences and efficiently learn copying and retrieval tasks. Additionally, the paper explores the training efficiency of the models and provides an empirical comparison of their training speed, inference capabilities, and performance on tasks requiring copying and retrieval.
Overall, the findings of the study suggest that Hawk and Griffin offer a powerful and efficient alternative to Transformers with global attention, showcasing their potential for advanced language modeling applications.
Reference: https://arxiv.org/abs/2402.194...