Key Points
1. The study investigates the inductive biases and scaling behaviors of different finetuning methods for large language models (LLMs) in a data-limited regime.
2. Systematic experiments were conducted to study the impact of LLM model size, pretraining data size, new finetuning parameter size, and finetuning data size on finetuning performance.
3. The study proposes a multiplicative joint scaling law for LLM finetuning, which generalizes to different settings and describes the scaling relationship between finetuning data size and each other scaling factor.
4. It is found that increasing LLM model size benefits LLM finetuning more than scaling pretraining data, and that scaling PET (parameter-efficient tuning) parameter sizes is generally ineffective.
5. The study indicates that optimal finetuning methods are highly task- and finetuning data-dependent.
6. Machine translation and multilingual summarization are considered as downstream tasks for finetuning, as they require resolving cross-lingual understanding and generation.
7. Analysis suggests that finetuning data has a more pronounced influence on full-model tuning (FMT) than parameter-efficient tuning (PET), and PET depends more on LLM model and pretraining data scaling than finetuning data scaling.
8. The study examines the critical finetuning data size between different finetuning methods and evaluates finetuning's impact on the generalization capability of the base LLM.
Summary
Research Focus and Key Findings
The research paper investigates the inductive biases and scaling properties of different finetuning methods for large language models (LLMs). The study systematically examines the impact of scaling factors such as LLM model size, pretraining data size, new finetuning parameter size, and finetuning data size on finetuning performance. The research considers two types of finetuning - full-model tuning (FMT) and parameter efficient tuning (PET) - and explores their scaling behaviors in the data-limited regime. The findings reveal a power-based multiplicative joint scaling law between finetuning data size and other scaling factors, the differential impact of LLM model scaling and pretraining data scaling on finetuning, and the task- and finetuning data-dependent nature of optimal finetuning methods.
Experimental Results and Analysis
The study includes experiments on bilingual machine translation and multilingual summarization benchmarks using pretrained bilingual LLMs from 1B to 16B. The results indicate that LLM finetuning follows a power-based multiplicative joint scaling law between finetuning data size and each of the other scaling factors. The study suggests that LLM finetuning benefits more from LLM model scaling than pretraining data scaling, and PET parameter scaling is generally ineffective. It also highlights the task- and finetuning data-dependent nature of the optimal finetuning method, offering insights into the understanding, selection, and development of LLM finetuning methods. Additionally, the research paper addresses the impact of finetuning on the generalization capability of the base LLM, showing that while finetuning on task-specific data improves task-specific performance, it may specialize the base LLM towards the task and hurt the models’ generalization. Furthermore, the study examines the effect of finetuning on the few-shot capability of the base LLM, indicating that FMT may reduce the few-shot performance of LLMs while PET behaves more robustly in retaining most of LLM’s few-shot capability regardless of model size and pretraining data size.
Insights from Experiments and Conclusions
The research paper investigates the inductive biases and scaling properties of different finetuning methods for large language models (LLMs). The study systematically explores the impact of scaling factors such as LLM model size, pretraining data size, new finetuning parameter size, and finetuning data size on finetuning performance. Two types of finetuning, full-model tuning (FMT) and parameter efficient tuning (PET), are considered, with a focus on their scaling behaviors in the data-limited regime. The findings reveal a power-based multiplicative joint scaling law between finetuning data size and other scaling factors, highlighting the differential impact of LLM model scaling and pretraining data scaling on finetuning. Additionally, the study uncovers the task- and finetuning data-dependent nature of optimal finetuning methods. The experiments conducted provide insights into the relationships between model size, pretraining data, finetuning parameter size, and finetuning data, shedding light on the nuanced interplay among these factors in the context of finetuning large language models.
Reference: https://arxiv.org/abs/2402.171...