Key Points
1. The necessity of a large SFT corpus has been a topic of debate with previous works arguing that fewer than 10K instances of SFT data are enough to produce satisfactory results. However, the experiments in the paper observed a significant performance decline on the IFEval benchmark if fewer than 10K instances were used. This underscores the critical need for sufficient data to equip a language model like DeepSeek-V2 with desired capabilities. Furthermore, the quality of SFT data is crucial, especially for tasks involving writing or open-ended questions.
2. During human preference alignment, a significant performance enhancement on open-ended generation benchmarks was observed, but a phenomenon of "alignment tax" was also noticed. This "alignment tax" negatively impacted the performance on some standard benchmarks. To alleviate the alignment tax, significant efforts in data processing and improving training strategies during the reinforcement learning stage were made, achieving a tolerable trade-off between the performance on standard and open-ended benchmarks.
3. The paper found that the online approach to preference alignment significantly outperformed the offline approach, leading to tremendous efforts in implementing an online reinforcement learning framework for aligning DeepSeek-V2.
4. DeepSeek-V2, a large MoE language model supporting 128K context length, was introduced in the paper. It is characterized by strong performance, economical training, and efficient inference. Compared with DeepSeek 67B, DeepSeek-V2 achieved significantly stronger performance, while saving training costs, reducing the KV cache, and boosting the maximum generation throughput.
5. DeepSeek-V2 and its chat versions have acknowledged limitations commonly found in other large language models, such as the lack of ongoing knowledge updates after pre-training, the possibility of generating non-factual information, and limited proficiency in languages beyond Chinese and English.
6. DeepSeek aims to invest in open-source large models with long-term goals, with ongoing exploration focused on devising methods to further scale up MoE models while maintaining economical training and inference costs.
7. The alignment team continuously strives to enhance the models, aiming to develop a model that is not only helpful but also honest and safe for worldwide users, aligning the values of the model with human values while minimizing the need for human supervision.
8. DeepSeek-V2 is currently designed to support the text modality exclusively, but there are plans to enable the model to support multiple modalities in the future, enhancing its versatility and utility in a wider range of scenarios.
9. The conclusion outlines the innovative architecture of DeepSeek-V2, its performance improvements over previous models, the acknowledged limitations, and the future research directions.
Summary
The research paper presents DeepSeek-V2, a strong Mixture-of-Experts (MoE) language model characterized by economical training and efficient inference. The model comprises 236B total parameters, with 21B activated for each token, and supports a context length of 128K tokens. The paper introduces innovative architectures such as Multi-head Latent Attention (MLA) and DeepSeekMoE. The MLA architecture significantly reduces the Key-Value (KV) cache during inference, while DeepSeekMoE allows training strong models at an economical cost through fine-grained expert segmentation and shared expert isolation. The model is pre-trained on a high-quality and multi-source corpus and further undergoes Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL) to unlock its potential.
The evaluation results show that DeepSeek-V2 achieves significantly stronger performance compared to its predecessor DeepSeek 67B and saves 42.5% of training costs. The model reduces the KV cache by 93.3% and boosts the maximum generation throughput to 5.76 times. DeepSeek-V2 achieves top-tier performance among open-source models with only 21B activated parameters. The evaluation results are available for English and Chinese open-ended conversation benchmarks as well as standardized exams and specialized tasks, such as code and math evaluations.
The DeepSeek-V2 model architecture includes a Multi-head Latent Attention design, which reduces the KV cache for generation. Additionally, the DeepSeekMoE architecture adopts fine-grained expert segmentation and shared expert isolation for higher potential in expert specialization. The training approach utilizes device-limited routing and auxiliary losses to achieve load balance during training. The paper also discusses the data construction, hyper-parameters, and infrastructures used for training. Furthermore, the paper illustrates the long context extension of the model, which extends the context window length from 4K to 128K, demonstrating robust performance across all context window lengths.
In addition to the quantitative evaluation results, the paper discusses the training and inference efficiency of DeepSeek-V2. The training costs are reduced, and the inference efficiency is significantly improved compared to previous models. The paper also presents the data bias mitigations and the representations of different Chinese tasks as well as several ablation studies to support the model's strong performance across various tasks. Overall, DeepSeek-V2 demonstrates significant advancements in performance and efficiency compared to existing language models, particularly in code and math evaluations and Chinese benchmarks.
Reference: https://arxiv.org/abs/2405.044...