Synopsis

1. Focused Transformer (FOT) for overcoming context length limitation

Researchers have developed a technique called Focused Transformer (FOT) to overcome the limitation of effective context length in large language models. FOT utilizes a training process inspired by contrastive learning to improve the structure of the (key, value) space and extend the context length. By fine-tuning pre-existing models, known as LONGLLAMA, using FOT, advancements have been made in tasks that require long context. These models have been evaluated on different tasks and datasets, showing improvements in accuracy and perplexity. The researchers emphasize the significance of differentiable keys and values in the training process and the advantages of using negative examples to handle distractions in the multi-document setting. Additionally, FOT is compared to Memorizing Transformer, and it is demonstrated that FOT achieves similar performance without the need for gating mechanisms. Future research directions include scaling up memory and exploring other contrastive learning methods.


2. Comparing Focused Transformer (FOT) with Memorizing Transformer (MT)


The article delves into Focused Transformer (FOT), a memory attention mechanism, and compares it with the baseline Memorizing Transformer (MT). FOT employs a memory storage of (key, value) pairs and retrieves the k most relevant entries to calculate attention values. It can integrate memory using either the standard Transformer formula or a gating mechanism. The experiments reveal that FOT using the standard formula is equally effective as the gating approach used in MT but is simpler and requires fewer parameters.


3. Utilizing k-nearest neighbors algorithm and memory storage in FOT


FOT utilizes the k-nearest neighbors (kNN) algorithm for memory lookup and does not rely on positional encodings. The memory is populated with (key, value) pairs processed before each document, and in the single-doc setting, the memory is cleared after each document.


4. Training procedure and handling global information in FOT


During training, FOT exposes specific attention layers to a combination of (key, value) pairs from the current local context and previous local contexts from other documents. The input pipeline is modified to assign each document a fixed batch index. This training procedure enables FOT to handle global information and differentiate through all (key, value) pairs, facilitating joint training of well-structured embeddings.


5. Performance of Focused Transformer (FOT) in various tasks


The experiments demonstrate that FOT performs well in various tasks, such as handling distractions in text and retrieving information from a large dictionary. The model achieves good performance even with short documents and surpasses the MT baseline in certain scenarios.


6. Distinctions and potential combination with Memorizing Transformer (MT)


The article also highlights the distinctions between FOT and MT, focusing on the training procedure, the use of memory during training, and the differentiation through retrieved tokens. The authors speculate about the potential benefits of combining the two approaches and provide a proof-of-concept experiment.


7. Experimental setup and datasets used


The experiments encompassed diverse datasets, including PG-19, arXiv, GitHub, and Isabelle. The models were trained on TPU virtual machines, and multiple runs were conducted to assess the significance of the results.
Overall, the results demonstrate the effectiveness of FOT in leveraging memory and its potential for various tasks, making it a promising approach for memory attention in Transformer models.

Reference: https://arxiv.org/abs/2307.031...