Key Points

- Recent developments in large multimodal models (LMMs) have expanded the capability boundaries of multimodal models, such as GPT-4V(ision) and Gemini, beyond traditional tasks like image captioning and visual question answering.

- This work explores the potential of LMMs like GPT-4V as a generalist web agent that can follow natural language instructions to complete tasks on any given website.

- The proposed S EE ACT is a generalist web agent that harnesses the power of LMMs for integrated visual understanding and acting on the web. It evaluates on the recent M IND 2W EB benchmark and presents a substantial potential for web agents.

- Websites pose a new challenge and opportunity for LMMs due to their visual nature, with screenshots of rendered websites being more complex than most existing benchmarks, requiring a different approach for understanding and comprehension.

- Grounding, i.e., converting the textual plan into precise actions on the website, remains a major challenge. Existing LMM grounding strategies are not effective for web agents, and the best grounding strategy developed in this work leverages both the HTML text and visuals.

- Experimental evaluation shows that S EE ACT with GPT-4V is a strong generalist web agent, especially when oracle grounding is provided. However, grounding is still a major challenge, and there is a performance gap with oracle grounding, leaving room for further improvement.

- In comparison with text-only large language models (LLMs) specifically fine-tuned for web agents, S EE ACT with GPT-4V demonstrates a performance advantage.

- In the context of online evaluation, S EE ACT with GPT-4V outperforms GPT-4 and FLAN-T5 on live websites, but there is a non-negligible discrepancy between online and offline evaluation due to the variability in potential plans for completing the same task.

- The whole task success rate of GPT-4V outperforms GPT-4 and FLAN-T5, indicating the potential of LMMs for generalist web agents but also pointing towards challenges such as grounding and variability in web interactions.

Summary

The paper explores the potential of large multimodal models (LMMs) like GPT-4V and Gemini as generalist web agents that can follow natural language instructions to complete tasks on any given website. The researchers propose S EE ACT, a web agent that harnesses the power of LMMs for integrated visual understanding and acting on the web. They evaluate the model on the M IND 2W EB benchmark and develop a tool that allows running web agents on live websites. The findings show that GPT-4V presents a great potential for web agents, successfully completing 50% of the tasks on live websites. However, fine-grained visual grounding remains a major challenge, and the best grounding strategy developed in the paper still has a 20-25% performance gap with oracle grounding.

Comparing Performance of Large Multimodal Models
The paper also compares the performance of LMMs with text-only language models (LLMs) on web agent tasks and explores different grounding strategies, including grounding via element attributes, textual choices, and image annotation. The authors find that despite the strengths of LMMs in visually understanding rendered webpages, grounding is still a major challenge, especially on complex webpage images. They also highlight the discrepancy between online and offline evaluations, emphasizing the importance of online evaluation for an accurate assessment of a model’s capabilities.

Societal Impacts and Limitations
Furthermore, the paper discusses potential societal impacts and limitations, such as limited experiment scale and safety concerns related to the real-world deployment of web agents. The authors highlight the possibilities and challenges of web agents in automating routine web tasks and discuss the need for further research to assess and mitigate the safety risks associated with web agents.

Conclusion
In conclusion, the paper demonstrates the potential of large multimodal models for generalist web agents, presents findings on the performance of different grounding strategies, and discusses the societal impact and limitations of web agents. The authors emphasize the importance of further research to improve visual grounding strategies and evaluate the safety implications of deploying web agents in real-world scenarios.

Reference: https://arxiv.org/abs/2401.01614