Key Points

1. Adam-mini is a new optimizer that achieves on-par or better performance than AdamW while using 45% to 50% less memory.

2. Adam-mini reduces memory by cutting down the number of learning rates used in the standard Adam optimizer, which typically assigns an individual learning rate for each parameter.

3. The key insight is that for Transformer models, the Hessian matrix has a near block-diagonal structure, and for each of these dense sub-blocks, a single high-quality learning rate can outperform the individual learning rates used in Adam.

4. Adam-mini partitions the model parameters into blocks based on the structure of the Hessian, and assigns a single learning rate to each block, rather than an individual rate for each parameter.

5. This partitioning strategy, which assigns a single rate per Hessian sub-block, is shown to perform on-par or better than Adam while using 90% fewer learning rate resources.

6. Empirically, Adam-mini matches or outperforms AdamW on various language models from 125M to 7B parameters for pre-training, fine-tuning, and reinforcement learning.

7. The reduced memory footprint of Adam-mini also leads to higher training throughput, with 49.6% higher throughput than AdamW when pre-training the 7B Llama2 model.

8. The authors propose a general principle for partitioning parameters based on the smallest dense sub-blocks in the Hessian, which can be applied to other neural network architectures beyond Transformers.

9. While the current design of learning rate assignment in Adam-mini is sufficient to match Adam's performance, the authors believe there is room for further improvement by more sophisticated learning rate design for each Hessian sub-block.

Summary

The paper introduces a new optimizer called Adam-mini, which aims to achieve similar or better performance than AdamW while requiring 45% to 50% less memory. This reduction in memory is achieved by cutting down the learning rate resources in Adam, specifically 1/v. The key ideas behind Adam-mini include carefully partitioning the parameters into blocks based on the Hessian structure, assigning a single good learning rate to each parameter block, and proposing a cost-effective method to find good learning rates.

Empirical findings demonstrate that Adam-mini achieves comparable or better performance than AdamW on various language models across tasks such as pre-training, supervised fine-tuning, and reinforcement learning from human feedback. Adam-mini also achieves higher throughput and reduces communication overheads among GPUs, ultimately saving time for pre-training.

The paper highlights the memory consumption burden of Adam in training large language models, and how reducing memory can benefit CPU offloading, sharding, and communication among GPUs and CPUs, leading to increased throughput and cost and energy savings. The paper also discusses the challenges of modifying Adam without sacrificing performance due to the lack of understanding of Adam's m and v components and the uncertainties regarding cutting down the v component.

The proposal of Adam-mini is motivated by the finding that it is possible to achieve similar or better performance with much fewer learning rates than Adam. The paper provides a detailed description of the proposed Adam-mini optimizer, including the partitioning strategy based on Hessian structure, the memory cut down, and the higher throughput achieved compared to AdamW.

The paper provides a comprehensive evaluation of Adam-mini across various language models and tasks, demonstrating its efficacy and efficiency in comparison to AdamW and other memory-efficient methods such as Adafactor, CAME, and SM3. Additionally, it discusses the potential for further optimization in the learning rate design, leaving it as an important future direction. Overall, the paper presents Adam-mini as a promising optimizer for reducing memory consumption without compromising performance in training large language models.

Reference: https://arxiv.org/abs/2406.167...