The authors of this article propose a new architecture called Retentive Network (RetNet) as a solution for large language models. The RetNet architecture addresses the challenge of achieving training parallelism, low-cost inference, and good performance simultaneously.

To achieve this, the authors introduce a retention mechanism that supports three computation paradigms: parallel, recurrent, and chunkwise recurrent. The parallel representation allows for training parallelism, which efficiently utilizes GPU devices. The recurrent representation enables low-cost inference with O(1) complexity, reducing memory consumption and inference latency without sacrificing performance. The chunkwise recurrent representation facilitates efficient long-sequence modeling by dividing the input sequences into chunks and performing parallel computation within each chunk while recurrently encoding the global blocks.

Experimental results on language modeling demonstrate that RetNet achieves favorable scaling results, parallel training, low-cost deployment, and efficient inference. RetNet outperforms Transformers in terms of memory consumption, throughput, and latency during inference. The memory cost of RetNet remains consistent even for long sequences, requiring significantly less GPU memory compared to Transformers with key-value caches. The throughput of RetNet is higher and remains consistent with increasing decoding length, thanks to the recurrent representation. The decoding latency of RetNet is also superior to Transformers, as it remains almost the same across different batch sizes and input lengths.

The authors conclude that RetNet is a strong successor to Transformer for large language models. Its combination of training parallelism, low-cost inference, and competitive performance makes it an ideal choice for large-scale language models, especially considering the benefits it brings in terms of deployment efficiency. Further research and scaling up of RetNet in terms of model size and training steps are planned for the future.

Reference: https://arxiv.org/abs/2307.086...