Key Points
1. Language model (LM) based agents are rapidly improving and can now tackle digital tasks like web navigation, but they struggle with long-horizon tasks with complex action trajectories.
2. Humans can flexibly solve complex tasks by learning reusable task workflows from past experiences and using them to guide future actions.
3. The paper introduces Agent Workflow Memory (AWM), a method for inducing commonly reused routines (workflows) and selectively providing them to the agent to guide subsequent generations.
4. AWM flexibly applies to both offline and online scenarios, where agents induce workflows from training examples beforehand or from test queries on the fly.
5. AWM experiments on the Mind2Web and WebArena web navigation benchmarks, showing substantial improvements over baseline results by 24.6% and 51.1% relative success rate.
6. AWM reduces the number of steps taken to solve WebArena tasks successfully.
7. Online AWM robustly generalizes in cross-task, website, and domain evaluations, surpassing baselines as train-test task distribution gaps widen.
8. Workflow representations that capture abstract sub-routines rather than concrete examples contribute to AWM's superior performance.
9. Augmenting agent memory with workflows induced from both offline and online sources shows promise but requires careful integration to avoid compatibility issues.
Summary
This paper introduces "Agent Workflow Memory" (AWM), a technique to help agents learn and apply reusable task workflows to improve their performance on complex, long-horizon tasks. The key findings of the paper are: 1. AWM allows agents to continuously induce and apply workflows to improve their task completion success rates and efficiency, substantially outperforming baseline methods. On the WebArena benchmark, AWM improves the top published autonomous method by 51.1% in relative success rate, and even outperforms methods augmented with human-written workflows by 7.9%. 2. AWM demonstrates strong generalization abilities across tasks, websites, and domains. On the Mind2Web benchmark, AWM effectively improves the cross-task results by 24.6% in relative step-wise success rate compared to prior state-of-the-art methods. 3. AWM can operate in both offline and online settings. In the offline setting, AWM extracts reusable workflows from available training examples. In the online setting, AWM iteratively induces workflows from self-generated predictions during test-time inference, without requiring any additional training data. 4. The paper examines the mechanism behind AWM's ability to build increasingly complex workflows over time, by learning from past experiences and earlier workflows. AWM can flexibly apply and build upon previously induced workflows to solve more complex tasks.
Overall, the paper shows that AWM, a method for inducing and applying reusable workflows, can substantially improve agent performance and generalization on complex web navigation tasks, outperforming prior state-of-the-art approaches. The findings highlight the importance of equipping agents with the ability to extract and leverage common task routines, rather than solving each task separately, in order to tackle increasingly complex real-world problems.
Reference: https://arxiv.org/abs/2409.074...