Key Points

1. The article introduces the Qwen2.5-Coder series, which is a significant upgrade from its predecessor, CodeQwen1.5. This series includes two models: Qwen2.5-Coder-1.5B and Qwen2.5-Coder-7B.

2. Qwen2.5-Coder is a code-specific model built upon the Qwen2.5 architecture and continues pretraining on a vast corpus of over 5.5 trillion tokens.

3. Through meticulous data cleaning, scalable synthetic data generation, and balanced data mixing, Qwen2.5-Coder demonstrates impressive code generation capabilities while retaining general versatility.

4. The model has been evaluated on a wide range of code-related tasks, achieving state-of-the-art (SOTA) performance across more than 10 benchmarks, including code generation, completion, reasoning, and repair, consistently outperforming larger models of the same model size.

5. The Qwen2.5-Coder series features two models: Qwen2.5-Coder-1.5B and Qwen2.5-Coder-7B, which have different architectural configurations in terms of number of layers, hidden size, and attention heads.

6. The researchers have constructed a large-scale, coding-specific pretraining dataset comprising over 5.5 trillion tokens, which includes source code data, text-code grounding data, synthetic data, math data, and text data.

7. The researchers have implemented a three-stage training approach for Qwen2.5-Coder, including file-level pretraining, repo-level pretraining, and instruction tuning.

8. To transform Qwen2.5-Coder into a coding assistant, the researchers have developed a well-designed instruction-tuning dataset that includes a wide range of coding-related problems and solutions.

9. The release of the Qwen2.5-Coder series aims to advance code intelligence research and promote widespread adoption in real-world applications, facilitated by permissive licensing.

Summary

Model Overview

The Qwen2.5-Coder series is a significant upgrade to the previous CodeQwen1.5 model, featuring two new versions - Qwen2.5-Coder-1.5B and Qwen2.5-Coder-7B. These models are built upon the advanced Qwen2.5 architecture and have been pretrained on an extensive corpus of over 5.5 trillion tokens, comprised of code, mathematics, and general text data. Through careful data cleaning, synthetic data generation, and balanced data mixing, the Qwen2.5-Coder models demonstrate impressive code generation capabilities while retaining general versatility.

The Qwen2.5-Coder series has been rigorously evaluated on a wide range of code-related tasks, including code generation, completion, reasoning, and repair. The models have achieved state-of-the-art performance across more than 10 benchmarks, consistently outperforming larger models of the same size. Specifically, the Qwen2.5-Coder-7B model has shown exceptional results, surpassing even much larger 20+ billion parameter models in several evaluations.

Beyond just code tasks, the Qwen2.5-Coder series has also demonstrated strong capabilities in mathematical reasoning and general natural language understanding, showcasing its overall versatility. The researchers have also developed advanced techniques like multilingual synthetic data generation and long-context modeling to enhance the models' capabilities.

Future Directions

The release of the Qwen2.5-Coder series aims to push the boundaries of research in code intelligence and encourage broader adoption by developers in real-world applications, facilitated by the models' permissive licensing. By continuing to scale up the data and model size, the researchers plan to further enhance the reasoning abilities of these code-specific language models and explore their full potential.

Reference: https://arxiv.org/abs/2409.12186