Key Points
1. The paper introduces Retrieval-Enhanced Transformers (Re t ro) as a method to model arbitrary text sequences while retrieving from databases with trillions of tokens, scaling the data available to models by an order of magnitude compared to what is typically consumed during training. Re t ro gains do not diminish for models with up to at least 7B parameters and correspond to non-retrieval models with 10× more parameters on certain datasets.
2. Re t ro models are flexible and can be used without retrieval at evaluation and still achieve comparable performance to baseline models. Conversely, baseline models can be rapidly fine-tuned into Re t ro models to obtain nearly the same performance as if trained from scratch.
3. The performance of Re t ro models is dataset-dependent, with the largest gains observed on datasets such as Wikitext103 and the Pile. Retrieval models outperform previous models trained on large-scale datasets on Wikitext103 and the Pile.
4. Careful analysis shows that only a modest fraction of the gains obtained by Re t ro are due to test set leakage. In general, they caution for such leakage in large-scale language datasets and suggest further work in better understanding the role of test set leakage in the performance of large-scale language models.
5. The paper demonstrates that semi-parametric approaches, such as Re t ro, can provide an orthogonal and more efficient approach than raw parameter scaling as we seek to build more powerful language models.
6. It is noted that retrieval reduces hallucinations and makes the model more knowledgeable when comparing with samples produced with retrieval disabled. The retrieval mechanisms offer a path to reducing the compute requirements needed to train and update language models that reach a certain performance.
7. Retro models are competitive with previous approaches on retrieval-intensive downstream tasks such as question answering, but further work is suggested to compete with T5-finetuned models.
8. The approach demonstrates the potential to mitigate privacy issues by removing the retrieved data at inference time and using differential privacy training.
9. The paper raises potential privacy, safety, and fairness issues and suggests that retrieval models can add a further source of bias through the selection mechanism for retrieval documents.
Summary
The authors introduce a retrieval-enhanced autoregressive language model (Retro) that leverages retrieval from a large corpus to improve language models by conditioning on document chunks retrieved from a large corpus based on local similarity with preceding tokens. The model, using a 2 trillion token database, achieves comparable performance to larger models such as GPT-3 and Jurassic-1 on the Pile, while using 25 times fewer parameters.
The authors show that Retro combines a frozen BERT retriever, a differentiable encoder, and a chunked cross-attention mechanism to predict tokens based on an order of magnitude more data than what is typically consumed during training. The authors introduce Retro as a semi-parametric approach that can be incorporated into existing models without additional training. The study demonstrates that the model scales well with model size and database size, achieving state-of-the-art results on a range of downstream evaluation datasets, including Wikitext103 and the Pile.
The authors also propose an evaluation methodology that is aware of proximity of test documents with the training set, addressing the problem of test set leakage. By examining the benefits and implications of retrieval models, the study suggests new avenues for improving language models through explicit memory at an unprecedented scale.
The paper presents extensive experimentation and evaluation to validate the performance and scalability of the proposed Retro model, offering insights into its potential benefits and addressing privacy, safety, and fairness concerns associated with large language models.
Reference: https://arxiv.org/abs/2112.04426