Key Points

1. The paper introduces BABILong, a new benchmark for evaluating the performance of NLP models in processing long documents with distributed facts.

2. Common methods are only effective for sequences up to 104 elements, but fine-tuning GPT-2 with recurrent memory augmentations enables it to handle tasks involving up to 11 × 106 elements.

3. Recent progress in machine learning has extended input size for commonly used models by three orders of magnitude, but further increase in input sequence length is limited by the quadratic scaling of compute required for the calculation of self-attention in transformers.

4. The paper focuses on extending the bAbI benchmark by adding in-context retrieval and develops the recurrent memory approach by adding in-context retrieval based on the recurrent memory embedding of input segments.

5. The paper evaluates the performance of GPT-4 and RAG on 'needle in a haystack' question answering tasks up to millions of tokens and demonstrates that recurrent memory transformer sets a new record for the sequence size processed by a single model, extending the known capabilities of neural networks.

6. The paper evaluates the performance of various models on the BABILong dataset and demonstrates that recurrent models consistently outperform their larger counterparts utilizing retrieval augmented generation.

7. Recurrent Memory Transformer (RMT) and RMT-R exhibit consistent performance on longer sequences, reaching 128k tokens, with only marginal quality degradation.

8. The paper proposes a new approach to build datasets and test the abilities of language models to find and reason about specific information within a context of millions of tokens, resulting in the bAbILong benchmark with infinitely long contexts.

9. Although recurrent approaches like RMT are hindered by their sequential nature, they compensate by constant memory requirements, and provide insight into the limitations of popular LLMs in effective long context utilization.

Summary

The paper introduces the BABILong benchmark for evaluating generative transformer models' performance in handling extensive texts. Common methods are found to be effective only for sequences up to 104 elements, but fine-tuning GPT-2 with recurrent memory augmentations enables it to handle tasks involving up to 11 × 106 elements - a substantial improvement in processing capabilities for long sequences.

The study compares GPT-4, RAG, and fine-tuned GPT-2 with recurrent memory augmentations, demonstrating significant advancements in handling long sequences. The authors propose a new approach to build datasets and to test language models' abilities to find and reason about specific information within context of millions of tokens. The recurrent models RMT and RMT-R consistently outperform LLMs on long sequences, suggesting their potential in handling long contexts.

Additionally, the paper discusses augmentation with recurrent memory as a promising approach to extend the context window of transformers and evaluates the performance of GPT-4, Mistral, and GPT-3.5 on the BABILong dataset. The study highlights limitations in popular LLMs regarding effective long context utilization and underscores the potential of recurrence paired with trainable self-retrieval mechanism in enhancing the generative models' capabilities.

Reference: https://arxiv.org/abs/2402.10790