Key Points
- Large language models (LLMs) can require significant resources to train from scratch, and merging existing pre-trained LLMs can be a more cost-effective approach.
- The paper introduces the concept of knowledge fusion for LLMs, aiming to combine the capabilities of existing LLMs and transfer them into a single, more powerful LLM.
- The proposed knowledge fusion approach leverages the generative distributions of source LLMs to externalize their collective knowledge and unique strengths and transfer them to the target LLM through lightweight continual training.
- The paper experiments with three popular open-source LLMs with distinct architectures and functionalities (Llama-2, OpenLLaMA, and MPT) and validates the effectiveness of the fusion approach across various benchmarks and tasks, including reasoning, commonsense, and code generation.
- The knowledge fusion approach shows improvement in the performance of the target model across a range of capabilities, outperforming each source LLM and the baseline method in most tasks.
- The approach prioritizes the fusion of multiple LLMs through knowledge externalization and transfer, distinguishing itself from traditional ensemble and weight merging techniques.
- The paper also discusses the integration of capabilities from diverse models, with traditional approaches falling into two categories: model ensemble and weight merging.
- The knowledge distillation approach, commonly used for model compression, involves training a student model under the guidance of one or more teacher models. However, the knowledge fusion approach in this paper involves a framework similar to multi-teacher knowledge distillation but with significant distinctions.
- The research provides details of the implementation of token alignment and fusion strategies for fusing different LLMs in the proposed FUSE LLM method, and evaluates the method on various benchmarks to demonstrate its effectiveness.
Summary
The paper explores the challenge of integrating capabilities from different large language models (LLMs) to create a unified model. The proposed method, FUSELLM, leverages the generative distributions of source LLMs to transfer their collective knowledge and individual strengths to a target LLM through lightweight continual training. The paper discusses aligning tokenizations and fusing probability distributions from diverse LLMs, demonstrating the superior potential of FUSELLM in combining the capabilities of structurally distinct LLMs compared to traditional methods.
The paper emphasizes the importance of the fusion function in effectively combining the strengths of source LLMs and highlights the promising avenue for exploration in the field of LLMs fusion, especially given the diverse structures and substantial model sizes of LLMs. The paper also compares FUSELLM with traditional ensemble and weight merging methods, and presents empirical evidence supporting the effectiveness of FUSELLM in improving the performance of a target model across various capabilities such as reasoning, commonsense, and code generation.
Additionally, the paper provides detailed implementation details of the FUSELLM method, including token alignment and fusion strategies. It also presents the evaluation results of FUSELLM on benchmarks related to reasoning, commonsense, and code generation, showing consistent improvements over original LLMs and demonstrating the potential of FUSELLM in effectively combining the collective knowledge and strengths of diverse LLMs.
Reference: https://arxiv.org/abs/2401.10491