Skeleton-of-Thought: Large Language Models Can Do Parallel Decoding (AI Summary)

The article introduces the "Skeleton-of-Thought" (SoT) approach as a solution to the problem of high generation latency in large language models (LLMs). Most state-of-the-art LLMs suffer from slow generation due to sequential decoding. SoT addresses this issue by guiding LLMs to first generate a skeleton of the answer and then filling in the details in parallel. This approach reduces generation latency and has the potential to improve answer quality.

Three main causes of slow generation in LLMs: large model size, the core attention operation with quadratic complexity, and sequential decoding. While existing research has focused on compressing or redesigning models and optimizing hardware, SoT tackles the third cause by questioning the assumption that LLMs must use sequential decoding.

Evaluation of SoT on 11 LLMs and its limitations
To evaluate SoT, the authors test it on 11 recently released LLMs. The results show significant speed-up, with up to 2.39× improvement across the models. SoT also demonstrates potential for improving answer quality in terms of diversity and relevance. However, it may have limitations for certain question categories that require sequential thinking, such as math and coding.

Future directions of SoT and potential improvements
The article also discusses the future directions of SoT. It suggests expanding SoT to include a more complex thinking process, similar to a "Graph-of-Thought." The importance of data-centric optimization for model efficiency is highlighted, along with the potential for further research in this area.

Overall effectiveness and benefits of SoT approach
Overall, SoT is a promising approach to reduce generation latency in LLMs and improve answer quality. It opens up new avenues for optimizing LLMs' thinking process and exploring data-centric approaches for efficiency. The experiments demonstrate the effectiveness of SoT in terms of answer quality, throughput, latency, and peak memory usage. SoT provides competitive answer quality while significantly improving efficiency, making it a valuable tool for optimizing LLM generation.

Reference: https://arxiv.org/abs/2307.153...

ML and AI papers

Skeleton-of-Thought: Large Language Models Can Do Parallel Decoding (AI Summary)

Recent posts

Foundational Models Defining a New Era in Vision: A Survey and Outlook (AI summary)

MedAgents: Large Language Models as Collaborators for Zero-shot Medical Reasoning (AI summary)

If LLM Is the Wizard, Then Code Is the Wand: A Survey on How Code Empowers Large Language Models to Serve as Intelligent Agents (AI summary)