Key Points

- Auto-regressive decoding is the predominant standard for large language models (LLMs) but is time-consuming and costly. Speculative sampling methods offer an alternative by dividing the generation process into draft and verification stages, significantly enhancing speed and maintaining text distribution integrity.

- EAGLE (Extrapolation Algorithm for Greater Language-model Efficiency) proposes a simple framework for lossless acceleration, operating the drafting process auto-regressively at the second-top-layer feature level and addressing sampling uncertainty by integrating tokens from one time step ahead. It provides lossless acceleration without fine-tuning the target LLM and maintains the same text distribution as vanilla auto-regressive decoding.

- EAGLE is the fastest known framework within the speculative sampling family, achieving substantial speedups (3x faster than vanilla decoding, 2x faster than Lookahead, and 1.6x faster than Medusa).

- The key to enhancing acceleration in speculative sampling lies in reducing the time overhead and improving the acceptance rate of the draft by the original LLM. EAGLE boasts low training costs and trains a decoder layer in 1-2 days on 4x A100 GPUs.

- EAGLE's innovative approach involves predicting features rather than tokens, leading to better accuracy and speedup ratios compared to token-based methods.

- Utilizing tree attention increases the acceptance length and speedup ratios, providing a notable improvement in acceleration, especially with batch sizes exceeding 1.

- EAGLE demonstrates robustness against feature errors, with minimal impact on acceptance rates, and is not sensitive to training data, reducing overhead significantly.

- EAGLE's compatibility with other acceleration technologies, such as gpt-fast, further enhances its speed and efficiency, showcasing potential for integration with additional acceleration methods.

- EAGLE's approach maintains output distribution integrity without any relaxations, setting it apart from other speculative sampling methods that utilize relaxations or thresholds for acceleration.

Summary

The paper introduces a new framework, EAGLE (Extrapolation Algorithm for Greater Language-model Efficiency), which aims to accelerate auto-regressive decoding for large language models (LLMs). It discusses the drawbacks of auto-regressive decoding and presents the concept of speculative sampling-based methods, which divide the generation process of LLMs into a draft stage and a verification stage.

EAGLE is evaluated on the MT-bench and is presented as a simple, reliable, and fast framework for speculative sampling, providing speed and reliability. The paper analyzes the factors contributing to the effectiveness of EAGLE, emphasizing the importance of utilizing top-layer features and the incorporation of sampling results in the draft model.

The framework is shown to be compatible with other acceleration technologies, such as quantization and compilation, and exhibits increased throughput for batch sizes greater than 1. EAGLE is compared with other speculative sampling-based methods and is found to be the fastest framework within this family. It is highlighted that EAGLE does not involve any fine-tuning of the original LLM and effectively maintains the output distribution, making it a promising approach for accelerating auto-regressive decoding for LLMs.

The paper also discusses the potential applications and advantages of EAGLE, demonstrating its robustness to feature errors, low sensitivity to training data, and potential compatibility with other acceleration technologies.

Reference: https://arxiv.org/abs/2401.15077