Key Points

1. The paper focuses on evaluating the planning abilities of large language models (LLMs) and the new Large Reasoning Models (LRMs), specifically OpenAI's o1 model, using PlanBench, a benchmark developed in 2022. The benchmark aims to assess the planning and reasoning capabilities of LLMs and LRMs, especially in response to the claim that o1 has been designed as a LRM with a different architecture and operation.

2. The authors found that the LLMs, despite their significant size and extensive investment, have shown slow and unsatisfactory progress on PlanBench. Even the introduction of the o1 (Strawberry) model, which is targeted as an LRM, displayed impressive performance improvements in comparison to LLMs but still fell short of fully saturating the benchmark, raising concerns about its accuracy, efficiency, and guarantees.

3. The study revealed that o1 performed exceptionally well on the original test set, achieving significantly high accuracy on Blocksworld instances. However, when tested on the Mystery Blocksworld domain, o1's performance lagged, failing to maintain the same level of high accuracy. The results also suggested that one-shot prompting did not necessarily yield improved performance over zero-shot prompting, especially for certain models.

4. Operational differences between LRMs and LLMs were emphasized, highlighting a potential refinement in LRMs' ability to make appropriate Chain-of-Thought (CoT) moves during reasoning tasks.

5. Performance assessment of o1 on larger problem sizes showed a decline in accuracy, indicating a lack of robustness in LRMs' planning capabilities. In addition, o1 showcased limitations in recognizing unsolvable problems, often generating incorrect or nonsensical plans.

6. The study discussed the trade-offs between accuracy, efficiency, cost, and guarantees associated with LRMs like o1 when compared to classical planners like Fast Downward, which provide correctness guarantees at considerably lower costs and computation times. The cost per instance was substantially higher for LRMs, presenting concerns about cost-effectiveness and predictability.

7. The paper pointed out the lack of interpretability and trust in o1, as its architecture and reasoning traces are concealed, making it a black box system with limited transparency. The model's reliance on opaque reasoning and its tendency to provide creative, but nonsensical justifications for incorrect decisions raised concerns about its reliability and interpretability.

8. The authors called for realistic evaluations of LLMs and LRMs, emphasizing the need for comprehensive assessments that consider accuracy, efficiency, cost, and guarantees while highlighting alternative approaches, such as LLM-Modulo systems, and classical planners that offer correctness guarantees at lower costs.

9. The research was supported by ONR grant N0001423-1-2409 and gifts from Qualcomm and Amazon, with input from fellow lab members and discussions exploring the performance of o1 in different benchmarks such as TravelPlan and Natural Plan.

Summary

The paper examines the planning abilities of large language models (LLMs) and a new class of models called Large Reasoning Models (LRMs), with a focus on OpenAI's o1 (Strawberry) model and its performance on the PlanBench benchmark.

Struggles with Planning Tasks
The results show that state-of-the-art LLMs still struggle with planning tasks, even on simple Blocksworld problems. While the best LLM, LLaMA 3.1 405B, achieved 62.6% accuracy on the standard Blocksworld domain, its performance dropped dramatically on the obfuscated "Mystery Blocksworld" version, with no LLM achieving even 5% accuracy. This suggests that LLMs rely primarily on approximate retrieval rather than true reasoning capabilities.

Improvement in Planning Performance

In contrast, the new o1 model, which is categorized as an LRM by the authors, shows a significant improvement in planning performance. On the original Blocksworld test set, o1-preview achieved 97.8% accuracy, far surpassing previous LLMs. On the more challenging Mystery Blocksworld, o1-preview managed 52.8% correct answers, again a substantial leap over LLMs.

Limitations of o1 Model
However, the authors note that o1's performance is still not robust, as its accuracy quickly degrades on larger problems requiring longer plans. On a set of 110 Blocksworld instances with 20-40 step plans, o1-preview's accuracy dropped to only 23.63%. Additionally, the model struggles to reliably identify unsolvable problems, with only 27% of such instances correctly identified on Blocksworld and just 16% on Randomized Mystery Blocksworld.

Efficiency and Cost Implications
The paper also discusses the efficiency and cost implications of using LRMs like o1. Unlike previous LLMs, whose costs are based on the number of input and output tokens, o1 charges based on the number of "reasoning tokens" used during inference, which can result in significantly higher costs. The authors compare o1's performance and cost to classical planners like Fast Downward, which can solve the benchmark instances much more efficiently and with guaranteed correctness.

Overall, the paper provides a comprehensive evaluation of the planning capabilities of LLMs and the new LRM models, highlighting the progress made by o1 while also identifying its limitations and the need for further development to achieve robust, efficient, and trustworthy planning abilities.

Reference: https://arxiv.org/abs/2409.133...