Key Points

1. The paper introduces OLMoE, a fully open, state-of-the-art language model leveraging sparse Mixture-of-Experts (MoE). OLMoE-1B-7B has 7 billion parameters but uses only 1 billion per input token.

2. OLMoE-1B-7B outperforms all available models with similar active parameters, even surpassing larger ones like Llama2-13B-Chat and DeepSeekMoE-16B.

3. The paper presents various experiments on MoE training, analyzing routing in the model and showing high specialization of the experts.

4. All aspects of the work are open-sourced: model weights, training data, code, and logs.

5. The paper finds that MoEs train around 2x faster than dense LMs with equivalent active parameters.

6. A critical design decision for making MoEs performant is the use of fine-grained routing with granular experts, with 64 small experts and 8 activated per layer.

7. The authors find that dropless token-based routing outperforms expert-based routing, and that shared experts are less effective than having separate experts.

8. Analysis shows that routing saturates early in pretraining, experts are rarely co-activated, and experts exhibit domain and vocabulary specialization.

9. The fully open release of OLMoE aims to facilitate more research and analysis to improve understanding of MoE-based language models.

Summary

OLMOE's Architecture and Training
The paper introduces OLMOE, a fully open and state-of-the-art language model that leverages a sparse Mixture-of-Experts (MoE) architecture. OLMOE-1B-7B has 7 billion total parameters but uses only 1 billion active parameters per input token, making it much more efficient than dense models of similar size. The researchers pretrain OLMOE-1B-7B on 5 trillion tokens and further adapt it to create OLMOE-1B-7B-INSTRUCT, which outperforms larger models like Llama2-13B-Chat and DeepSeekMoE-16B on various benchmarks.

The paper provides details on the MoE training process, including the use of fine-grained experts, dropless token-based routing, load balancing loss, and router z-loss. The researchers experiment with alternative design choices for MoEs, such as the number of experts, whether to use shared experts, and different routing algorithms. They find that using many small experts with token-based routing outperforms fewer larger experts and expert-based routing.

Analysis of OLMOE's Behavior
The paper also analyzes the behavior of OLMOE, investigating router saturation, expert co-activation, domain specialization, and vocabulary specialization. They find that routing in OLMOE saturates early in training, with the same 8 experts being used for a given input across most of pretraining. The experts exhibit strong specialization, with certain experts focusing on particular domains like GitHub and arXiv, as well as on specific vocabulary terms.

Implications of OLMOE-1B-7B
Overall, the paper demonstrates that OLMOE-1B-7B, a model with only 1 billion active parameters, can outperform much larger dense models, highlighting the potential of sparse MoE architectures. The researchers open-source the model weights, training data, code, and logs, enabling further research into improving cost-efficient and high-performing language models.

Reference: https://arxiv.org/abs/2409.020...