Key Points

1. The paper presents the first comprehensive survey of small language models (SLMs) with 100M-5B parameters, analyzing their capabilities, runtime costs, and technical innovations.

2. SLMs have been gaining increasing attention from both research and industry, with a significant surge in the number of models since late 2023.

3. Analysis of SLM architectures shows a trend towards using group-query attention, gated FFN with SiLU activation, and RMS normalization, indicating gradual architectural innovations.

4. The quality of pre-training datasets plays a crucial role in SLM capabilities, with recent datasets like DCLM and FineWeb-Edu exhibiting superior performance.

5. SLMs are often "over-trained" on large amounts of data (typically over 1.5 trillion tokens), exceeding the optimal ratio suggested by the Chinchilla law.

6. SLMs have demonstrated significant performance improvements across various language tasks from 2022 to 2024, outpacing the progress of the LLaMA-7B series.

7. In-context learning capability varies across different tasks, with larger SLMs generally exhibiting stronger in-context learning abilities.

8. Model architecture, quantization methods, and hardware all have significant impacts on SLM runtime latency and memory usage, highlighting the importance of co-design.

9. The paper discusses potential research directions, including co-design of SLM architecture and hardware, high-quality synthetic dataset construction, deployment-aware model scaling, and device-cloud collaboration.

Summary

This paper presents a comprehensive survey and analysis of small language models (SLMs) with 100M-5B parameters. SLMs have become widely adopted in smart devices, despite receiving significantly less academic attention compared to their large language model (LLM) counterparts. The paper examines 59 state-of-the-art open-source SLMs, analyzing their technical innovations across architectures, training datasets, and training algorithms.

SLM Architectures

The analysis of SLM architectures reveals several key trends. As of 2024, typical SLM configurations tend to use group-query attention, gated feed-forward networks with SiLU activation, and RMS normalization. The choice of these settings is mostly empirical, without rigorous validation of their superiority. The paper also finds that architectural innovations beyond the vanilla transformer are limited, and their significance remains to be explored. In terms of training datasets, the paper observes that while the Pile was the most widely used dataset initially, more recently proposed datasets like RefinedWeb and RedPajama have gained prominence. The quality of these datasets, assessed by the performance of SLMs trained on them, shows that model-based data filtering can significantly improve the capabilities of SLMs. The paper also finds that SLMs are often "over-trained" on massive amounts of data (>1.5 trillion tokens) compared to the optimal ratio suggested by the Chinchilla law.

Capabilities of SLMs
The paper then evaluates the capabilities of SLMs across various tasks, including commonsense reasoning, problem-solving, and mathematics. The results show substantial performance improvements in SLMs from 2022 to 2024, outpacing the progress of the LLaMA-7B series. While closed-source models like the Phi family demonstrate state-of-the-art performance, open-source SLMs are also closing the gap, particularly in commonsense reasoning tasks.

Runtime Costs
The paper also provides a detailed analysis of the runtime costs of SLMs, including latency and memory usage, on edge devices like the Jetson Orin Module and smartphones. The results highlight the importance of co-designing SLM architectures with target hardware, as factors like attention mechanism, feed-forward network, and vocabulary size can significantly impact runtime performance. The paper also examines the effects of various quantization methods and their trade-offs in reducing latency and memory footprint. In conclusion, the paper offers valuable insights into the evolving landscape of SLMs and identifies several promising research directions, including co-design of SLM architectures and device processors, construction of high-quality synthetic datasets, development of deployment-aware model scaling strategies, and exploration of sparse SLM architectures.

Reference: https://arxiv.org/abs/2409.15790