Key Points

1. Web automation is important for streamlining tasks by automating common web actions, enhancing efficiency, and reducing the need for manual intervention. Traditional methods, like wrappers, have limited adaptability and scalability when encountering new websites. Generative agents empowered by large language models (LLMs) also exhibit poor performance and reusability in open-world scenarios.

2. The paper proposes a new paradigm of combining LLMs with crawlers for the web crawler generation task for vertical information web pages. This paradigm aims to improve efficiency in handling diverse and changing web environments.

3. The authors introduce AUTO C RAWLER, a framework that leverages the hierarchical structure of HTML for progressive understanding. This two-stage framework uses top-down and step-back operations to learn from erroneous actions, prune HTML for better action generation, and handle the complexities of web structures.

4. The paper presents comprehensive experimental results with multiple LLMs, demonstrating the effectiveness of AUTO C RAWLER in the web crawler generation task. It outperforms the state-of-the-art baseline in this task.

5. The paper identifies challenges in generating crawlers for LLMs, such as the limited understanding of HTML structures, the complex nature of semi-structured data, and the reliance on LLMs even for similar tasks, leading to low efficiency in managing a large volume of web pages.

6. The authors also raise questions about the widespread use of LLMs for web automation, emphasizing the need to enhance LLMs' capability to understand HTML and the limitations of the proposed framework in the context of existing web environments and reusability.

Summary

The paper introduces the challenges with traditional web automation methods and proposes the AUTOCRAWLER framework as a solution to these challenges. Traditional methods like wrappers struggle with adaptability and scalability when facing new website structures, while generative agents empowered by large language models (LLMs) exhibit poor performance and reusability. In response, the paper introduces the crawler generation task and the paradigm of combining LLMs with crawlers to improve the reusability of web automation. The AUTOCRAWLER framework is a two-stage framework that leverages the hierarchical structure of HTML for progressive understanding. Through top-down and step-back operations, it can learn from erroneous actions and continuously prune HTML for better action generation.

The paper conducts comprehensive experiments with multiple LLMs and demonstrates the effectiveness of the AUTOCRAWLER framework. The experiments involve multiple closed-source and open-source LLMs, and the results show that the framework outperforms existing state-of-the-art baseline methods in the web crawler generation task. The framework is also shown to improve the performance of LLMs in generating action sequences, particularly with stronger LLMs generating shorter action sequences. However, the paper acknowledges some limitations of the framework, such as its restriction to the information extraction task for vertical webpages and the reliance on the performance of backbone LLMs. The authors propose future work on enhancing LLMs’ ability to understand HTML and extending the framework to work with existing web environments. The paper also declares adherence to ethical standards in the use of human annotations and data security measures for the datasets used.

Reference: https://arxiv.org/abs/2404.127...