Key Points
1. Modern Large Language Models (LLMs) have become very large, even exceeding hundreds of billions of parameters, making their inference slow and expensive.
2. Speculative sampling methods have been proposed to address this issue by rapidly generating draft tokens and then verifying them in parallel, significantly reducing inference latency.
3. Most speculative sampling methods, such as EAGLE, use a static draft tree, implicitly assuming that the acceptance rate of draft tokens depends only on their position. However, the authors found that the acceptance rate of draft tokens is also context-dependent.
4. To address this, the authors propose EAGLE-2, which introduces a new technique of context-aware dynamic draft tree into drafting modeling. This allows EAGLE-2 to leverage the fact that the draft model of EAGLE is well-calibrated, with the confidence scores approximating acceptance rates with small errors.
5. Extensive evaluations on three series of LLMs and six tasks show that EAGLE-2 achieves speedup ratios of 3.05x-4.26%, which is 20%-40% faster than EAGLE-1.
6. EAGLE-2 ensures that the distribution of the generated text remains unchanged, making it a lossless acceleration algorithm.
7. EAGLE-2 offers advantages over EAGLE such as out-of-the-box usability (no additional training required) and reliability (does not fine-tune or update the original LLM).
8. The core idea of EAGLE-2 is to dynamically adjust the draft tree structure based on the confidence scores from the draft model, which approximates the acceptance rates of draft tokens.
9. Experiments show that expanding the draft tree based on the "value" (product of confidence scores) outperforms expanding based on just the confidence scores, demonstrating the rationale behind the EAGLE-2 approach.
Summary
The research paper "EAGLE-2: Faster Inference of Language Models with Dynamic Draft Trees" aims to address the expensive and time-consuming inference process of Large Language Models (LLMs) by proposing an acceleration algorithm called EAGLE-2. The paper builds upon the EAGLE framework and introduces a new technique of context-aware dynamic draft tree into drafting modeling to achieve faster inference speeds. The authors conducted extensive evaluations on three series of LLMs and six tasks to compare EAGLE-2 with other speculative sampling methods, demonstrating its improved speedup and lossless acceleration compared to previous methods like EAGLE-1.
The paper starts by highlighting the challenges of inference with modern Large Language Models (LLMs) due to their substantial parameter sizes, leading to slow and expensive autoregressive generation. Speculative sampling methods, such as EAGLE, have been effective in reducing inference latency by rapidly generating draft tokens and verifying them in parallel. However, previous methods like EAGLE have used a static draft tree structure, which assumes that the acceptance rate of draft tokens depends only on their position. The authors found that the acceptance rate of draft tokens is context-dependent, leading them to propose EAGLE-2, which dynamically adjusts the draft tree structure based on the confidence scores from the draft model, achieving faster inference speeds and ensuring the distribution of the generated text remains unchanged.
The authors conducted extensive evaluations on six tasks, including multi-turn conversation, code generation, mathematical reasoning, instruction following, summarization, and question answering, using various LLMs and datasets. EAGLE-2 demonstrated significant speedup ratios, outperforming other speculative sampling methods across different tasks and datasets. The speedup ratios and average acceptance lengths for EAGLE-2 were consistently higher than other methods, indicating its effectiveness in improving inference efficiency without compromising the distribution of the generated text.
Furthermore, the paper discusses the advantages of EAGLE-2, such as out-of-the-box usability, reliability, and lossless acceleration algorithm. The authors also conducted an ablation study to evaluate the performance impact of the expansion and reranking phases of EAGLE-2, demonstrating the rationale behind its approach and providing insights into its effectiveness in dynamically adjusting the draft tree structure.
In conclusion, the paper introduces EAGLE-2 as an efficient and lossless speculative sampling method, leveraging the confidence scores from the draft model to approximate acceptance rates and dynamically adjusting the draft tree structure. The extensive evaluations on various tasks and LLMs showcase the superiority of EAGLE-2 in achieving faster inference speeds while preserving the distribution of the generated text, making it a significant contribution to improving inference efficiency in large language models.
Reference: https://arxiv.org/abs/2406.168...