Key Points
1. Introduction and Objective: The technical report presents Aya 23, a family of multilingual language models developed to expand state-of-the-art natural language processing capabilities to approximately half of the world’s population. Aya 23 targets 23 languages, aiming for better performance by focusing more model capacity on fewer languages compared to its predecessor, Aya 101, which supported 101 languages.
2. Model Architecture and Design: Aya 23 models are based on the Cohere Command series and utilize a variety of techniques like parallel attention and feedforward layers, SwiGLU activation, rotary positional embeddings, and grouped-query attention (GQA). These architectural choices aim to enhance training efficiency, performance, and stability.
3. Pre-training and Fine-tuning: Aya 23 models were pre-trained using a data mixture that included texts from 23 languages and adopted a decoder-only Transformer architecture. Fine-tuning utilized a variety of multilingual datasets, combining human-annotated data, translated data, and synthetic data generated by other models.
4. Evaluation Framework: The performance of Aya 23 was evaluated using a comprehensive multilingual evaluation framework across multiple tasks: discriminative tasks (e.g., XWinograd, XCOPA), general-purpose language understanding (e.g., multilingual MMLU), mathematical reasoning (MGSM), as well as generative tasks like machine translation and summarization (FLORES, XLSum).
5. Performance and Results: Aya 23 models, particularly the 8B and 35B parameter versions, demonstrated significant performance improvements over previous models like Aya 101 and other contemporaries such as Mistral and Gemma. For instance, they showed up to 41.6% better performance in multilingual MMLU and a 6.6x improvement in multilingual mathematical reasoning tasks.
6. Model Sizes and Usability: Two sizes of Aya 23 models were released: 8-billion (8B) and 35-billion (35B) parameters. The 8B model was noted for its outstanding performance on consumer-grade hardware, while the 35B model achieved higher overall results across the evaluation tasks.
7. Human and LLM Evaluations: In evaluation scenarios involving human judgments and LLM-simulated win-rates using GPT-4, the Aya 23 models consistently outperformed baselines like Aya 101. This was particularly evident in non-European languages, where models showed superior win rates in languages such as Turkish, Hindi, and Japanese.
8. Safety, Toxicity, and Bias Assessments: Aya 23 models were evaluated for safety using adversarial prompts from the multilingual AdvBench and for toxicity towards identity groups. Results showed a notable reduction in harmful responses and generally lower toxicity levels compared to Aya 101, although improvements are still needed for specific identity groups and languages.
9. Future Work and Ethical Considerations: While Aya 23 has significantly advanced the performance for 23 languages, the report acknowledges the existing limitations, especially in underrepresented languages. It underscores the need for continued efforts to improve language coverage and inclusivity, particularly in regions like Asia and Africa. The release aims to inspire more research and development in multilingual language technologies.
Summary
The research paper titled "arXiv:2405.15032v2 [cs.CL] 31 May 2024" introduces the Aya 23 family of multilingual language models, complementing the Aya model by focusing on pairing a pre-trained model with the Aya collection. Aya 23 covers 23 languages and aims to improve state-of-the-art language modeling capabilities for a significant portion of the world's population. The Aya 23 model is an experiment in exploring the impact of allocating more capacity to fewer languages during pre-training, in contrast to the Aya 101 model, which covered 101 languages but suffered from the "curse of multilinguality."
Aya 23 Model Features and Enhancements
The researchers report that Aya 23 outperforms previous massively multilingual models like Aya 101, as well as widely used models like Gemma, Mistral, and Mixtral, across a range of discriminative and generative tasks. They release the open weights for both the 8B and 35B models as part of their commitment to expanding access to multilingual progress. The Aya 23 model family is based on the Cohere Command series models and uses a standard decoder-only Transformer architecture with a range of enhancements, such as SwiGLU activation, RoPE, and parallel attention and FFN layers.
The paper presents detailed analyses of the Aya 23 model family's performance across various tasks, including discriminative tasks, general-purpose language understanding, multilingual mathematical reasoning, machine translation, summarization, preference evaluation, safety, toxicity, and bias. The evaluation framework includes zero-shot tasks, 5-shot evaluation, and comparisons with other massively multilingual models. The paper also emphasizes the importance of advancing multilingual technologies to better reflect the world's linguistic diversity.
The human evaluation results demonstrate that the Aya 23 family consistently outperforms the Aya-101-13B model, with Aya-23-8B achieving an average win rate of 50.8% against Aya-101-13B across languages. Additionally, the paper discusses the safety, toxicity, and bias analyses, highlighting the lower expected toxicity and reduced harmful responses of the Aya 23 models compared to the Aya-101-13B model.
Limitations and Future Directions
The paper acknowledges the limitations of the Aya 23 model family, particularly in terms of coverage and performance for underrepresented languages and regions. It also outlines future directions for improving language inclusivity and addressing cultural and linguistic nuances to ensure equitable and effective language technologies for all. The researchers express their commitment to furthering future research and contributing to the mission of advancing multilingual technologies.
In summary, the paper provides a comprehensive overview of the development, performance, and considerations of the Aya 23 family of multilingual language models, emphasizing their significant impact on improving multilingual language modeling capabilities and addressing the need for more equitable language technologies.
Reference: https://arxiv.org/abs/2405.150...