Key Points
- OpenELM outperforms existing large language models pretrained on publicly available datasets, including the recent open LLM, OLMo, by 2.36% while requiring 2× fewer pre-training tokens.
- It utilizes layer-wise scaling to efficiently allocate parameters within each layer of the transformer model, leading to enhanced accuracy and improved parameter allocation across layers.
- OpenELM releases the complete framework for training and evaluation of the language model on publicly available datasets, including training logs, multiple checkpoints, and pre-training configurations, aiming to strengthen the open research community.
- Pre-training is conducted using public datasets, including RefinedWeb, deduplicated PILE, a subset of RedPajama, and a subset of Dolma v1.6, totaling approximately 1.8 trillion tokens.
- OpenELM adopts a decoder-only transformer-based architecture and utilizes various techniques such as pre-normalization, rotatory positional embedding, grouped query attention, SwiGLU FFN, flash attention, and the same tokenizer as LLama.
- It implements non-uniform allocation of parameters across layers using layer-wise scaling, adjusting the number of attention heads and the FFN multiplier in each transformer layer.
- OpenELM exhibits an increase in accuracy with longer training durations across most tasks, and the checkpoint obtained by averaging the last five checkpoints demonstrates comparable or slightly better accuracy compared to the final checkpoint obtained after 350k iterations.
- Instruction tuning consistently improves OpenELM’s average accuracy by 1-2% across different evaluation frameworks, and parameter-efficient fine-tuning methods can be applied to OpenELM with comparable accuracy results.
- OpenELM’s benchmarking results show that it may be slower than OLMo for inference performances, and there is a substantial potential for future optimization of inference efficiency.
Summary
The research paper presents a new language model called OpenELM, designed for efficient parameter allocation within the transformer model. The model is compared with other publicly available language models (LLMs) and is found to outperform comparable-sized existing LLMs pretrained on publicly available datasets. OpenELM achieves an average accuracy improvement of 2.36% over OLMo while requiring 2× fewer pre-training tokens. The paper discusses the tasks listed in the OpenLLM leaderboard, where OpenELM's performance is highlighted with models pretrained with less data.
The release of OpenELM is positioned as an effort to advance open research and transparency in language models. OpenELM's open-source framework includes training logs, multiple checkpoints, pre-training configurations, and MLX inference code. The researchers argue that the reproducibility and transparency of large language models are crucial for advancing open research, ensuring trustworthiness of results, and enabling investigations into data and model biases, as well as potential risks.
The paper details the framework and methodology used in developing OpenELM, including its layer-wise scaling strategy for efficient parameter allocation across layers. The authors also discuss the training process using public datasets and the adoption of decoder-only transformer-based architecture. Additionally, the paper presents the results of the evaluation of OpenELM across various tasks and benchmarks, demonstrating its improved accuracy over existing models. The study also explores the use of OpenELM on different hardware, including Torch Inductor and Apple MLX library on Apple devices.
Overall, the research paper highlights the significance of OpenELM in advancing open research and emphasizes the need for thorough safety testing and appropriate filtering mechanisms when using language models for specific applications.
Reference: https://arxiv.org/abs/2404.146...