Key Points

1. The research proposes Compositional Foundation Models for Hierarchical Planning (HiP) as a solution for making effective decisions in novel environments with long-horizon goals, emphasizing the importance of hierarchical reasoning across spatial and temporal scales.

2. HiP leverages multiple expert foundation models trained on language, vision, and action data individually to solve long-horizon tasks. It involves planning abstract subgoal sequences, visually reasoning about the underlying plans, and executing actions in accordance with the devised plan through visual-motor control.

3. The proposed foundation model reduces the data requirements for constructing the foundation models by leveraging separate expert models trained on different modalities, thereby reducing the need for collecting expensive paired data across modalities.

4. HiP enforces consistency between the models via iterative refinement to ensure coherent decision-making. The iterative refinement mechanism uses feedback from downstream models to maintain consistency and enable hierarchically consistent and executable plans.

5. The research paper also describes the model architecture and provides pseudocode for the planning process, where HiP factors into task distribution, visual distribution, and action distribution.

6. The paper presents experimental evaluation of HiP on three environments and compares its performance with several baselines, demonstrating that HiP significantly outperforms the baselines in solving both seen and unseen long-horizon tasks, indicating the importance of hierarchy, task planning, and visual planning.

7. The study also investigates the benefit of pretraining the video diffusion model with the Ego4D dataset, highlighting that pretraining leads to a higher success rate and lower Fréchet Video Distance (FVD) scores, demonstrating the effectiveness of pretraining on Internet-scale data.

8. The proposed approach addresses the challenges of training from scratch for long-term planning by reducing the need for costly and tedious collection of paired data across different modalities, such as language, vision, and action.

9. The research findings reveal that the proposed HiP framework offers promising results in solving long-horizon tasks requiring hierarchical reasoning and decision-making, thus showing potential for practical applications in various domains.

Summary

Model Overview and Performance
The paper proposes a Compositional Foundation Model for Hierarchical Planning (HiP) to address the challenge of generating action trajectories for long-horizon tasks specified by language goals. HiP leverages three levels of hierarchy: task planning, visual planning, and action planning to construct long-horizon plans. The model uses a large language model (LLM) for task planning, a video diffusion model for visual planning, and an inverse dynamics model for action planning. The researchers demonstrate the efficacy of HiP in solving long-horizon tasks in three different environments and compare its performance with several baselines, showing significant advantages in task completion rates.

Iterative Refinement Mechanism
A key contribution of the paper is the iterative refinement mechanism introduced to ensure consistency across the disparate models, enabling hierarchically consistent plans that are responsive to the goal and executable given the current state and agent. The iterative refinement procedure promotes consensus among the different models and facilitates the generation of consistent plans. The research also investigates the impact of pretraining the video diffusion model on a large-scale text-to-video dataset, showing that pretraining leads to higher success rates and lower Fréchet Video Distance (FVD) scores, indicating greater similarity between generated videos and ground truth videos.

Furthermore, the paper evaluates HiP's ability to solve long-horizon planning tasks in different environments and demonstrates its superior performance compared to the baselines. HiP's performance remains intact when solving unseen long-horizon tasks containing novel combinations of object colors, categories, and subtasks. The results indicate that HiP outperforms other strategies for constructing robot manipulation policies conditioned on language goals, highlighting the significance of the hierarchy, task planning, and visual planning components in the HiP model.

Summary and Conclusion
In summary, the paper introduces HiP, a compositional foundation model for hierarchical planning that leverages separate expert models trained on language, vision, and action data to construct long-horizon plans. The research demonstrates the effectiveness of the model in solving long-horizon tasks and showcases its superior performance compared to several baselines in different environments.

Reference: https://arxiv.org/abs/2309.08587