Key Points
1. The paper introduces a framework called Manager-Worker-Self-Evaluator, which combines external knowledge retrieval and internal experience fusion for planning tasks. The Manager uses the fused knowledge to generate a queue of subtasks and associated contexts.
2. The Worker learns from subtask experiences and reflects on the entire episode to refine its strategies. It then generates a structured response for grounded actions and signals the completion of a subtask either through "DONE" or "FAIL."
3. The Self-Evaluator generates summarized experiences as textual rewards, which are saved back into the episodic memory and narrative memory upon completion of tasks by the Manager.
4. Supplementary examples of successful and failed tasks across different domains are presented to supplement the qualitative analysis, providing detailed error analysis to complement the findings.
5. Detailed error analysis and failure examples are provided, where the sources of execution errors, such as planning, execution, and grounding errors, are analyzed in detail. Planning errors, in particular, are highlighted as causing execution errors and leading to incomplete tasks.
6. Empirical analysis reveals that grounding and planning errors often directly lead to execution errors, with 46% of the execution errors caused by planning or grounding errors. Reducing these errors, particularly grounding errors, is suggested to significantly improve performance.
7. Examples of failed tasks are presented, along with detailed error analysis. Issues such as inaccurate planning information, erroneous task sequencing, and misselection of elements leading to failed task execution are described and analyzed in detail.
8. Specific instances of failed tasks are examined, where planning deficiencies propagate into execution errors and prevent the agent from successfully completing the intended tasks, highlighting the challenges in achieving consistently reliable behavior.
9. The paper emphasizes the importance of addressing planning and grounding errors to improve task completion and outlines the significance of reducing these errors for enhancing the performance of the Manager-Worker-Self-Evaluator framework.
Summary
The paper "AGENT S: A N O PEN AGENTIC F RAMEWORK THAT U SES C OMPUTERS L IKE A H UMAN" introduces Agent S, an open agentic framework aimed at transforming human-computer interaction by automating complex, multi-step tasks through Graphical User Interface (GUI) automation. The key challenges in automating computer tasks are acquiring domain-specific knowledge, planning over long task horizons, and handling dynamic, non-uniform interfaces. Agent S addresses these challenges through experience-augmented hierarchical planning, which learns from external knowledge search and internal experience retrieval at multiple levels, facilitating efficient task planning and subtask execution. It employs an Agent-Computer Interface (ACI) to better elicit the reasoning and control capabilities of GUI agents based on Multimodal Large Language Models (MLLMs).
Performance Evaluation
Agent S outperforms the baseline by 9.37% on success rate and achieves a new state-of-the-art on the OSWorld benchmark. The paper also introduces a specific language-centric Agent-Computer Interface (ACI) as an abstraction layer to improve grounding, safety, and efficiency for MLLM-based GUI agents. The paper further evaluates Agent S on the WindowsAgentArena benchmark and demonstrates its broad generalizability to different operating systems. The results indicate consistent improvements by Agent S across various computer task categories and operating systems.
Framework Details
The paper describes the action space and presents a detailed ablation study of the Agent-Computer Interface (ACI), experience-augmented hierarchical planning, and memory construction. Results illustrate the critical role played by both the Continual Learning component and the Self-Evaluator in enhancing the performance of Agent S. The paper also provides a thorough error analysis of failed tasks and demonstrates the potential of MLLM agents to learn from external sources and through direct interaction with the environment, without any human or environmental feedback in the GUI agents domain. Moreover, the paper proposes future work to consider the number of agent steps and wall clock time required for task completion and extend the ideas of experiential learning and Agent-Computer Interface for smaller, open-source MLLMs which could be fine-tuned to bridge the gap.
Worker-Manager Framework
------------------------
The paper discusses a Worker-Manager framework in which the Worker refines its strategies through trajectory reflection, action generation, and subtask completion. This process involves the Worker generating a structured response for grounded actions using retrieved subtask experience. When a subtask is completed, the Self-Evaluator generates episodic experience, which is then saved into the episodic memory. Additionally, the Manager generates task completion rewards, which are saved into the narrative memory. The paper also provides supplementary examples for qualitative analysis, showcasing successful task examples from different domains. It presents detailed error analyses and examples of failed tasks, highlighting sources of execution errors and their impacts on performance.
Task Execution Analysis
Furthermore, the paper includes a section on the agent's tasks, which depicts successful task examples but also reveals issues in execution trajectories. It highlights issues such as incorrectly entering data, inappropriate actions, failure to recognize task completion, and attempts to recover existing files. The detailed error analysis demonstrates sources of execution errors and their impact. Planning, execution, and grounding errors are highlighted, indicating the challenges in achieving consistently reliable behavior, even with nominally completed tasks.
Error Analysis and Impact
The qualitative analysis provides several examples of failed tasks and their detailed error analyses. It empirically shows how grounding and planning errors often directly lead to execution errors, which can result in repetitive actions and wrong decisions during task performance. The analysis identifies the impact of planning deficiencies on execution errors, as well as instances where grounding errors lead to execution errors. Overall, the paper meticulously analyzes the Worker-Manager framework, providing insights into the generation of structured responses, saving experiences into memory, and the challenges and impacts of planning, execution, and grounding errors. The detailed examples and error analyses contribute to a comprehensive understanding of the framework and the complexities involved in task completion and the refinement of strategies.
Reference: https://arxiv.org/abs/2410.08164v1