Key Points

1. The paper addresses the challenge of efficiently running large language models (LLMs) that exceed the available DRAM capacity by storing the model parameters on flash memory and bringing them on demand to DRAM.

2. The proposed method involves constructing an inference cost model that harmonizes with the flash memory behavior, guiding the optimization process in two critical areas: reducing the volume of data transferred from flash and reading data in larger, more contiguous chunks.

3. The paper introduces two principal techniques within the flash memory-informed framework: "windowing" to strategically reduce data transfer by reusing previously activated neurons, and "row-column bundling" to increase the size of data chunks read from flash memory.

4. The methods collectively enable running models up to twice the size of the available DRAM and achieve a 4-5x and 20-25x increase in inference speed compared to naive loading approaches in CPU and GPU, respectively.

5. The paper also explores the characteristics of memory storage systems and their implications for LLM inference, elucidating the challenges and hardware-specific considerations essential for algorithm design, particularly in optimizing inference when working with flash memory.

6. The study presents a detailed analysis of predictors, window configurations, latency, and related methodological and hardware optimizations to achieve efficient inference for large language models on devices with limited memory.

Summary

Introduction and Key Contributions
The paper addresses the challenge of efficiently running large language models (LLMs) that exceed the available DRAM capacity by storing the model parameters on flash memory and bringing them on demand to DRAM for inference. The proposed method involves constructing an inference cost model that harmonizes with the flash memory behavior and optimizing in two key areas: reducing data transferred from flash and reading data in larger, more contiguous chunks. Two principal techniques, "windowing" and "row-column bundling", are introduced to achieve these optimizations. The paper demonstrates that these methods enable running models up to twice the size of the available DRAM, with a significant increase in inference speed compared to traditional approaches in CPU and GPU. The integration of sparsity awareness, context-adaptive loading, and hardware-oriented design is shown to pave the way for effective inference of LLMs on devices with limited memory.


Leveraging Sparsity and Selective Loading

The method leverages the sparsity observed in the FeedForward Network (FFN) layers of LLMs to selectively load only non-zero input or predicted non-zero output parameters from flash memory. The "windowing" technique strategically reduces data transfer by reusing previously activated neurons, while "row-column bundling" increases the size of data chunks read from flash memory, tailored to the sequential data access strengths of flash memory.


Hardware-Specific Considerations

The paper also discusses the hardware-specific considerations essential for algorithm design, particularly in optimizing inference when working with flash memory. It is noted that flash memory performs optimally with large sequential reads and that it is important to efficiently manage loaded data in memory to minimize overhead. Strategies for reducing latency under memory constraints are categorized into three areas: reducing data load, optimizing data chunk size, and efficient management of loaded data.


Experimental Validation

The authors experimentally demonstrate the effectiveness of their proposed method on the OPT 6.7B and Falcon 7B models, showcasing substantial improvements in latency and efficiency compared to baseline approaches. They analyze the impact of predictors, window configuration, and selective weight loading on inference performance on different hardware setups.


In conclusion, the paper proposes algorithmic techniques to minimize weight loading from flash memory during LLM inference, demonstrating substantial speedups on CPU and GPU. The authors emphasize the significance of their work in enabling efficient inference of LLMs on devices with limited memory and highlight the potential for further improvements in this area.

Reference: https://arxiv.org/abs/2312.11514