The article introduces the "Skeleton-of-Thought" (SoT) approach as a solution to the problem of high generation latency in large language models (LLMs). Most state-of-the-art LLMs suffer from slow generation due to sequential decoding. SoT addresses this issue by guiding LLMs to first generate a skeleton of the answer and then filling in the details in parallel. This approach reduces generation latency and has the potential to improve answer quality.


Three main causes of slow generation in LLMs: large model size, the core attention operation with quadratic complexity, and sequential decoding. While existing research has focused on compressing or redesigning models and optimizing hardware, SoT tackles the third cause by questioning the assumption that LLMs must use sequential decoding.


Evaluation of SoT on 11 LLMs and its limitations

To evaluate SoT, the authors test it on 11 recently released LLMs. The results show significant speed-up, with up to 2.39× improvement across the models. SoT also demonstrates potential for improving answer quality in terms of diversity and relevance. However, it may have limitations for certain question categories that require sequential thinking, such as math and coding.


Future directions of SoT and potential improvements

The article also discusses the future directions of SoT. It suggests expanding SoT to include a more complex thinking process, similar to a "Graph-of-Thought." The importance of data-centric optimization for model efficiency is highlighted, along with the potential for further research in this area.


Overall effectiveness and benefits of SoT approach

Overall, SoT is a promising approach to reduce generation latency in LLMs and improve answer quality. It opens up new avenues for optimizing LLMs' thinking process and exploring data-centric approaches for efficiency. The experiments demonstrate the effectiveness of SoT in terms of answer quality, throughput, latency, and peak memory usage. SoT provides competitive answer quality while significantly improving efficiency, making it a valuable tool for optimizing LLM generation.

Reference: https://arxiv.org/abs/2307.153...