Key Points

1. Large Language Models (LLMs) such as ChatGPT and LLaMA have demonstrated proficiency in reasoning, planning, and learning from experience, but their performance is limited in non-English languages due to being primarily pretrained on English-dominant corpus.

2. An empirical study based on LLaMA was conducted to investigate language capability transfer to non-English languages. Factors analyzed include vocabulary extension, further pretraining, and instruction tuning.

3. The impact of vocabulary extension was evaluated, and it was found that further pretraining with 0.5 billion Chinese tokens on the original vocabulary significantly outperforms performance on the extended vocabulary, suggesting that vocabulary extension might not be suitable for small-scale incremental pretraining.

4. It was observed that further pretraining with hundreds of thousands of instruction data, rather than a large-scale further pretraining, is required to enhance LLaMA's response quality.

5. Exclusive reliance on Chinese corpora for transfer training compromises LLaMA's original English proficiency, which can be alleviated through multilingual joint training.

6. The evaluation results demonstrate comparable knowledge level and response quality to state-of-the-art transfer models with less than 1% of the pretraining data, across thirteen low-resource languages.

7. The study identified the presence of code-switching instances during transfer training, indicating cross-lingual alignment internalized within the model.

8. Multilingual language models, such as mBERT and XLM-R, have demonstrated high cross-lingual transferability.

9. The study provides assistance and guidance to the community in developing non-English LLMs, addressing the resource gap and improving cross-lingual transferability.

Summary

The research paper investigates the transfer of language capabilities in large language models (LLMs), focusing on the LLaMA model, particularly to non-English languages. The study analyzed key factors such as vocabulary extension, pretraining scales, and instruction tuning on transfer and utilized standardized testing benchmarks to evaluate the model's knowledge level and response quality. The findings indicate that vocabulary extension may not be favorable for small-scale pretraining, and effective transfer training requires a balance between pretraining and instruction tuning. The study revealed that exclusive reliance on Chinese corpora for transfer training compromises the original English proficiency of LLaMA.

It also demonstrated that with less than 1% of the pretraining data, comparable transfer performance to state-of-the-art models can be achieved. Additionally, the study explored the transfer of LLaMA's capabilities to 13 low-resource languages, showing significant improvement in response quality and cross-lingual alignment. The results also showed instances of code-switching during transfer training, suggesting the internalization of cross-lingual alignment within the model. The paper provides important insights and guidance for the development of non-English LLMs, addressing the resource gap and the cross-lingual effectiveness of LLMs. The findings have implications for bridging the language capability gap in the field of natural language processing and offer new perspectives on multilingual language models and code-switching research.

Reference: https://arxiv.org/abs/2401.01055