Key Points

- Introduction to the high cost of full-parameter fine-tuning of Large Language Models (LLMs) and the development of parameter-efficient fine-tuning methods.

- Introduction of A STRAIOS, a suite of 28 instruction-tuned OctoCoder models using 7 tuning methods and 4 model sizes up to 16 billion parameters.

- Findings that full-parameter fine-tuning generally leads to the best downstream performance across all scales, and parameter-efficient fine-tuning methods differ significantly in their efficacy based on the model scale.

- LoRA usually offers the most favorable trade-off between cost and performance for instruction-tuned models.

- Exploration of the effects of these methods on both model robustness and code security, revealing that larger models tend to demonstrate reduced robustness and less security.

- Relationships among updated parameters, cross-entropy loss, and task performance are explored, with the tuning effectiveness observed in small models seen to generalize well to larger models.

- The best PEFT methods for Code LLMs are investigated, with considerations beyond Task-Specific LLMs, diverse domains, inclusive evaluation, and scalability.

- Detailed model choices, training configurations, and evaluations are provided for easy reproduction of the experimental results.

- The study emphasizes the importance of understanding the behavior of instruction-tuned Code LLMs, promoting comprehensive evaluation and inspiring further research.

Summary

The paper investigates parameter-efficient fine-tuning (PEFT) methods for Large Language Models (LLMs) trained on Code (Code LLMs) for the instruction-tuning paradigm. The study introduces ASTRAIOS, a suite of 28 instruction-tuned Code LLMs, fine-tuned with 7 tuning methods based on the StarCoder base models, and explores opportunities for further research. The main findings reveal the best PEFT methods for Code LLMs, with a focus on scalability, model evaluation, and experimental results comparing FFT with PEFT methods in terms of downstream performance, model robustness, and code security.

The paper finds that FFT generally leads to the best downstream performance across all scales, and PEFT methods differ significantly in efficacy based on the model scale. LoRA usually offers the most favorable trade-off between cost and performance. The study also explores the scalability of different tuning methods and identifies the need for further exploration in diverse domains and inclusive PEFT methods. Regarding code comprehension and generation tasks, larger models demonstrate reduced robustness and less security. Moreover, the paper reveals that tuning effectiveness observed in small models generalizes well to larger models, and validation loss in instruction tuning can be a reliable indicator of overall downstream performance.

In addition, the paper evaluates the robustness of code generation models and code security, revealing that larger PEFT Code LLMs perform better on code generation tasks but exhibit vulnerabilities to adversarial examples and biases towards insecure code. The paper highlights the importance of understanding these models through comprehensive evaluation and discusses the relationships among updated parameters, cross-entropy loss, and task performance. The findings contribute to understanding the mechanism of tuning methods and inspire further follow-up work.

Overall, the study provides insights into the best PEFT methods for Code LLMs, scalability of different tuning methods, model robustness, and code security, offering valuable contributions to the field of large language models for code understanding and generation.

Reference: https://arxiv.org/abs/2401.00788