Key Points

- The paper addresses the subject of scaling model capacity through increased width to improve model performance in neural networks for language and vision models.

- It discusses the importance of adapting network initialization and learning strategies as the model width increases to prevent numerical instabilities and ensure efficient learning.

- The paper introduces the Maximal Update Parameterization (µP) - a set of scaling rules for the initialization scheme, learning rate, and network architecture to ensure stability and feature learning in the infinite width limit.

- The study proposes scaling rules for weights initialization, learning rates, and architectural adjustments to achieve stability and maximal feature learning in the infinite-width limit, in contrast to standard and kernel parameterizations.

- It presents a non-rigorous but intuitive proof for efficient fine-tuning with LoRA and explores various settings and hyperparameters in the experimental setup using GPT2, Roberta-base models, and Llama-7b model finetuning.

- The results of the experimental setup using GPT2, Roberta-base models, and Llama-7b model finetuning were evaluated for accuracy and loss in different tasks and configurations.

- The paper assesses the optimal learning rates for test accuracy as compared to train accuracy and explores the impact of fine-tuning precision and model configurations on training and test accuracy.

- The study presents detailed experimental setups, including model specifications, datasets, training methods, and hyperparameters, for the experiments conducted with different models and tasks.

- The paper provides empirical results and visualizations to complement the findings and discussions presented in the main text, shedding light on the behavior of the models in various scenarios and configurations.

Summary

The paper discusses the suboptimal finetuning of models with large width due to the use of Low Rank Adaptation (LoRA), and compares it with the proposed algorithm LoRA+ using different learning rates for the LoRA adapter matrices A and B. When models are scaled in width to increase capacity, the network initialization and learning should be adapted to avoid numerical instabilities and ensure efficient learning.

The paper introduces the Maximal Update Parameterization (µP) as a set of scaling rules for stability and feature learning in the infinite width limit. It specifies rules for the initialization, learning rate, and network architecture to ensure stability and feature learning in the limit. The paper mathematically proves the inefficiency of LoRA fine-tuning when weights are initialized using certain methods and trained with gradient descent with specific learning rates, and demonstrates efficient fine-tuning with LoRA using different learning rates for A and B.

The results are supported by experiments with MLP and fine-tuning of GPT2, Roberta-base, and Llama-7b models on various tasks and datasets. It is shown that efficient finetuning is achieved with specific learning rate configurations, and the paper also provides detailed experimental setups and evaluations for these models. The paper presents empirical results, including the accuracy and loss heatmaps for different configurations, and discusses the optimal learning rate choices for test and train accuracy.

Furthermore, it evaluates the average accuracy on Meta-Dataset (MMLU) using 5-shot prompting and presents the results for full precision finetuning and different values of LoRA rank. The experiments also demonstrate the model's potential for overfitting and the impact of learning rate configurations on train and test loss.

Reference: https://arxiv.org/abs/2402.12354