Key Points

1. The paper presents a comprehensive report on compressing the Llama 3.1 8B and Mistral NeMo 12B models to 4B and 8B parameters, respectively, using pruning and distillation.

2. Two distinct pruning strategies are explored: (1) depth pruning and (2) joint hidden/attention/MLP (width) pruning. The results are evaluated on common benchmarks from the LM Evaluation Harness.

3. The pruned models are aligned with NeMo Aligner and tested in instruct-tuned versions. This approach produces a 4B model from Llama 3.1 8B and a state-of-the-art 8B model from Mistral NeMo 12B.

4. Due to lack of access to the original training data, the authors find it beneficial to slightly fine-tune the teacher models on the distillation dataset, which they refer to as "teacher correction". This helps address the data distribution mismatch and improves distillation performance.

5. The width-pruned Llama-3.1-Minitron-4B model outperforms the depth-pruned variant across a variety of benchmarks, including MMLU, HumanEval, and reasoning tasks. However, the depth-pruned model provides higher inference speedup.

6. For depth pruning, the authors find that dropping contiguous layers is more effective than importance-based non-contiguous pruning for downstream task performance.

7. Compared to similarly-sized models, the MN-Minitron-8B demonstrates superior accuracy across the board, outperforming the recent Llama 3.1 8B model using 40x fewer training tokens.

8. The instruction-tuned Llama-3.1-Minitron 4B variants show strong instruction-following and roleplay capabilities, only lagging behind Gemma2 on some benchmarks, while achieving state-of-the-art performance on retrieval-based question answering and function-calling.

9. The Llama-3.1-Minitron-4B models provide significant inference speedup, with the depth-pruned variant achieving an average throughput improvement of 2.7x over the original Llama 3.1 8B model.

Summary

Research Topic and Methodology
The research paper presents a comprehensive report on compressing the Llama 3.1 8B and Mistral NeMo 12B models into 4B and 8B parameters, respectively, using pruning and distillation techniques. Two distinct pruning strategies, depth pruning and joint hidden/attention/MLP (width) pruning, are explored and evaluated on common benchmarks from the LM Evaluation Harness.

The models are then aligned with NeMo Aligner and tested in instructed-tuned versions, resulting in a compelling 4B model from Llama 3.1 8B and a state-of-the-art Mistral-NeMo-Minitron-8B (MN-Minitron-8B) model from Mistral NeMo 12B. Additionally, the researchers open-sourced their base model weights on Hugging Face with a permissive license, providing access to the models Mistral-NeMo-Minitron-8B-Base, Llama-3.1-Minitron-4B-Width-Base, and Llama-3.1-Minitron-4B-Depth-Base. The study reveals that combining weight pruning with knowledge distillation significantly reduces the cost of training large language model (LLM) families.

By compressing the Llama 3.1 8B and Mistral NeMo 12B models down to 4B and 8B parameters, respectively, the compressed models outperform similar-sized models across common language modeling benchmarks. It is also found that fine-tuning the teacher models on the distillation dataset is beneficial in the absence of access to the original data. The paper discusses the pruning process, highlighting the use of structured pruning techniques such as neuron, attention head, convolutional filter, and depth pruning. It emphasizes the importance of teacher correction, where the teacher model is finetuned on the target dataset to be used for distillation.

The study explores the retraining strategies, including conventional training and knowledge distillation using supervision from the unpruned model. Moreover, it evaluates the instruction-following capabilities of the distilled models and provides performance comparisons for various models on specific benchmarks. Additionally, the paper presents observations on the impact of teacher correction, the effectiveness of width vs. depth pruning, the improvement in throughput of the compressed models, and the findings from a series of ablation studies.

Results and Contributions
Overall, the research paper provides detailed insights into the compression of large language models using pruning and distillation techniques, highlighting the benefits, methodology, and performance improvements achieved with the compressed models. The findings are supported by thorough evaluations on various benchmarks, providing valuable contributions to the field of language model compression.

Reference: https://arxiv.org/abs/2408.11796