Key Points
1. Large Language Models (LLMs) are essentially n-gram models on steroids, trained on web-scale language corpora, and possess linguistic behaviors that were unexpected for text completion systems.
2. LLMs are considered as giant non-veridical memories akin to an external System 1 and are good at approximate retrieval, meaning they probabilistically reconstruct completions for prompt words.
3. Research evaluated GPT3, GPT3.5, and GPT4 on planning instances and found contrary results to anecdotal claims about the planning abilities of LLMs, with GPT4 reaching 30% empirical accuracy in the Blocks World.
4. Fine tuning LLMs on planning problems did not show significant improvement, and it essentially converts planning tasks into memory-based approximate retrieval.
5. LLMs are capable of generating potential answers for planning problems, but they need to be checked by external verifiers, and their ability to critique their own guesses is questionable.
6. Claims about the planning capabilities of LLMs often confuse general planning knowledge extracted from LLMs for executable plans, and the generated plans may not account for subgoal interactions, leading to errors.
7. LLMs serve as a rich source of approximate models of world/domain dynamics and user preferences, which can be leveraged in LLM-Modulo frameworks with model-based solvers and human verification.
8. It is argued that LLMs do not possess reasoning/planning capabilities as normally understood, but instead excel in idea generation, which can be effectively leveraged to support reasoning/planning in LLM-Modulo frameworks.
9. LLMs have amazing approximate retrieval abilities that can be gainfully leveraged, and there is no compelling reason to believe that they possess reasoning/planning capabilities as normally understood.
Summary
Capabilities of Large Language Models (LLMs)
The research paper investigates the capabilities of Large Language Models (LLMs) in planning and reasoning tasks. LLMs are described as n-gram models on steroids that have been trained on web-scale language corpora, essentially acting as giant non-veridical memories. The paper questions whether LLMs are capable of principled reasoning, highlighting that they excel in approximate retrieval rather than guaranteeing memorizing complete answers.
Evaluation of GPT3.5 and GPT4
The study evaluates the planning abilities of GPT3.5 and GPT4, with results indicating a modest improvement in plan generation accuracy from GPT3 to GPT4. It was found that fine-tuning LLMs on planning problems did not significantly improve their performance and only converted the planning task into memory-based approximate retrieval. The claims of LLMs being zero-shot reasoning capable are analyzed, and it is suggested that LLMs struggle with planning autonomously without some form of nudging.
The paper explores the challenges of determining whether LLMs are memorizing or solving problems and discusses the methodologies for improving planning and reasoning performance, such as external model-based plan verification. It emphasizes that LLMs are proficient in idea generation and approximate retrieval but cautions against ascribing questionable reasoning/planning capabilities to them. The study also questions claims of LLMs' planning abilities, noting that LLMs may not be capable of true reasoning and planning as normally understood.
Scrutiny of Previous Claims and Advocacy for LLM-Modulo Frameworks
The paper scrutinizes previous claims of planning capabilities of LLMs, suggesting that many evaluations confound general planning knowledge with complete executable plans, often relying on human correction or ignoring subgoal interactions. It also highlights the limitations of LLMs in self-verification and discusses the potential use of LLMs as approximate models of world/domain dynamics and user preferences, provided the outputs are verified and refined by humans. The study ultimately advocates for using LLM-Modulo frameworks to leverage LLMs' approximate retrieval abilities without attributing questionable reasoning/planning capabilities to them.
Reference: https://arxiv.org/abs/2403.041...