Key Points

1. The paper explores alternative architectural choices for language models, particularly focusing on LLaMA and Mistral, in comparison to GPT2. The key architectural differences between these models are highlighted, such as the use of MLP layers with gated activation, differences in weight tying, and allocation of parameters in the attention and MLP layers.

2. The paper discusses various model architecture choices and their impact on different settings, including the 1000-exposure setting and the 100-exposure setting. It presents results showing how model architecture choices have a negligible impact on scaling laws in the 1000-exposure setting, while in the 100-exposure setting, certain architectures may lag behind GPT2’s scaling law, indicating differences in learning speed and stability in training.

3. It also explores the impact of reducing the size of the MLP layer in GPT2 and removing MLP layers entirely, showing that adjusting the size and presence of MLP layers can have a significant impact on the model’s capacity.

4. The paper delves into the detailed comparisons between different architectures, such as LLaMA, Mistral, GPT20, and GPT21/4, and provides insights into the influence of different architectural choices on the learning speed and stability of training.

5. The paper discusses the parameter choices for the experiments, including varying learning rates and different batch sizes, for each model architecture in the different settings.

6. The paper also quantizes the mixed-precision fp16 trained model into int8 and int4, showing that while quantizing into int8 shows no change, quantizing into int4 results in a capacity ratio loss greater than 2x.

7. The paper presents an overview of the experimental setups and compares the different model architectures' scaling laws in different exposure settings based on detailed parameter choices and controlled experimentation.

8. Finally, the paper discusses the findings and implications of the study, highlighting the impact of different architectural choices on learning speed, stability in training, and model capacity, offering insights into the relative performance of different language model architectures in different training settings.

Summary

The study finds that the LLaMA architecture performs comparably to GPT2, with discrepancies being mitigated by tying weights in both LLaMA and Mistral architectures. Contrary to conventional beliefs, it suggests that even reducing the MLP size or eliminating all MLP layers in GPT2 does not affect its capacity ratio, indicating that Attention layers are also capable of storing knowledge. The results also show differences in architectures in an insufficient training regime, with LLaMA architecture's capacity ratio being 1.3x worse than GPT2 even for large models.

Furthermore, the research explores the impact of data quality on the scaling laws of useful knowledge capacity. It finds that the presence of low-quality data significantly impacts knowledge capacity unless training time is substantially increased. Also, modifying the training data to add diversity enhances the model's ability to extract knowledge, potentially increasing its capacity.

The study also discusses the knowledge storage efficiency of Models of Experts (MoE) architectures, highlighting their ability to store knowledge effectively. Additionally, it investigates the impact of fine-tuning on the extractability of knowledge, examining the accuracy of memorizable and extractable knowledge.

The research paper encapsulates the methodology for scaling laws of language models and the impact of various parameters and data characteristics on knowledge storage capacity. It provides comprehensive insights into the capacity of language models to store and extract knowledge, and the implications of different model architectures and training data on their performance.

In summary, the paper presents detailed findings on the relationship between model size and knowledge storage capacity, offering insights into the capacity of language models to store and extract knowledge and the impact of various model architectures and training data on their performance.

Reference: https://arxiv.org/abs/2404.054...