Key Points

1. The paper explores the utilization of 8-bit (FP8) low-precision training for large language models (LLMs) to reduce computational costs and memory usage, aiming to make LLM training more efficient.

2. Most existing LLM training systems use FP32 full-precision or FP16/BF16 mixed-precision; however, the paper argues that training LLMs with reduced-precision FP8 poses new challenges due to the narrower dynamic range and representation precision of FP8.

3. The proposed FP8 mixed-precision training framework incorporates 8-bit collective communication, optimizer, and distributed parallel training in an incremental manner, making it the first work to infiltrate FP8 compute, storage, and communication into the entire LLM training process.

4. Experimental results demonstrate that the proposed FP8 methodology effectively reduces communication overhead, curtails memory utilization, and enhances system utilization efficiency on the Nvidia H100 GPU platform, resulting in a notable speed-up of 75% for GPT-175B model training.

5. The paper compares models trained using FP8 mixed-precision with those trained using higher-precision BF16 and demonstrates that FP8 pre-trained models exhibit equivalent performance and zero-shot performance in comparison to their BF16 counterparts.

6. Additionally, the proposed FP8 framework is applied to fine-tuning LLMs in instruction following and reinforcement learning with human feedback and shows comparable performance to models utilizing higher precision, while achieving significant improvements in training speed.

7. The work identifies challenges associated with reduced precision for the variables in the optimizer and during FP8 gradient all-reduce communication and proposes techniques, such as precision decoupling and automatic scaling, to address these challenges effectively during the training process.

8. The authors also acknowledge the limitations of previous low-precision training schemes, such as FP16 and BF16, and highlight the emerging importance of FP8 for reducing training costs and improving the efficiency of large language model training.

9. The paper lists the contributions of the various authors involved in the project, attributing key roles to individuals in developing and implementing the FP8 mixed-precision training framework, conducting experiments, and contributing to the advancements in low-precision training for LLMs.

Summary

FP8 Low-Precision Training Framework for Large Language Models
The paper proposes a new FP8 low-precision training framework for large language models. The main goal is to reduce the computational and memory costs associated with training large models by utilizing low-precision 8-bit data formats. The authors introduce a new automatic mixed-precision framework for training large language models (LLMs) that incorporates three levels of FP8 utilization to streamline mixed-precision and distributed parallel training for LLMs.

The experimental results show that the proposed FP8 mixed-precision training framework reduces real memory usage and training time, achieving a remarkable 39% reduction in real memory usage and 75% faster training on a GPT-175B model on the Nvidia H100 GPU platform compared to the widely adopted BF16 framework (Megatron-LM). Furthermore, the proposed FP8 low-precision training methodology can be seamlessly applied to other tasks such as LLM instruction tuning and reinforcement learning with human feedback, offering savings in fine-tuning expenses.

The authors also provide an analysis of comparing the maximum model sizes attainable through the utilization of either the prevalent BF16 or their FP8 mixed-precision training approach on a cluster of Nvidia H100 GPUs with 80GB memory. The paper's findings suggest that the proposed FP8 low-precision training framework can be a significant contribution to the design of next-generation low-precision training systems dedicated to large foundation models.

The authors also discuss various aspects including the challenges of training LLMs with FP8, experiments validating the proposed FP8 low-precision framework, and performance comparisons with existing mixed-precision schemes. The paper concludes by acknowledging the contributions of the co-authors and expressing expectations for the impact of the proposed FP8 framework in the field of large language models.

AI Industry Shift towards FP8 Specification
The research paper discusses the shift in the AI industry from high-precision to low-precision training, focusing specifically on the adoption of the FP8 specification for AI model training. The FP8 format introduces two sub-formats, E5M 2 and E4M 3, which offer a trade-off between a larger range and higher precision of stored values. However, compared to higher precision data formats such as FP16 and FP32, FP8 suffers from lower representation range and precision. To address these challenges, the paper explores the use of tensor scaling techniques, specifically global, block-wise, and layer-wise gradient scaling, which aim to mitigate representation range and precision degradation when using FP8 for training large models.

Pre-Training Data Collection Processes
The paper also details the processes used for pre-training data collection. It involves collecting data from various sources such as Redpajama, The Pile, HuggingFace, and CommonCrawl snapshots, with a focus on data deduplication, language identification, and fuzzy deduplication. Additionally, Python code data is collected from GitHub, utilizing a stringent cleaning process to ensure high-quality data suitable for academic research.

Overall, the paper provides a comprehensive understanding of the challenges and techniques associated with implementing low-precision training in AI model development, specifically focusing on the FP8 specification, as well as the complexities involved in the pre-training data collection process.

Reference: https://arxiv.org/abs/2310.18313