INTRODUCTION

The paper introduces the concept of tool learning, which combines large language models (LLMs) with application programming interfaces (APIs) to accomplish complex tasks. While current open-source LLMs have versatile capabilities, they lack sophistication in understanding human instructions and interacting with APIs. Previous works exploring instruction tuning for tool use have limitations in terms of limited APIs, constrained scenarios, and inferior planning and reasoning methods. To address these challenges, the authors propose ToolLLM, a framework for data construction, model training, and evaluation. They collect a high-quality instruction-tuning dataset called ToolBench, consisting of real-world REST APIs from RapidAPI. They generate diverse instructions involving both single-tool and multi-tool scenarios and annotate high-quality responses using a novel planning and reasoning strategy called depth-first search-based decision tree (DFSDT). They develop an automatic evaluator called ToolEval with pass rate and win rate metrics to assess tool-use capabilities. By fine-tuning LLaMA on ToolBench, they obtain ToolLLaMA, which demonstrates impressive performance in handling single-tool and complex multi-tool instructions, with robust generalization to previously unseen APIs. DFSDT significantly improves annotation efficiency, and a neural API retriever is incorporated to recommend relevant APIs. The authors hope their work will inspire further research in instruction tuning and tool use.


Category


The Category section of this research paper focuses on the Tool API (Application Programming Interface). An API is a set of rules and protocols that allow different software applications to communicate with each other. In this context, the Tool API refers specifically to a set of functions and protocols that enable the interaction between various tools used in scientific research.

The paper discusses the importance of the Tool API in streamlining and automating scientific workflows. It highlights that scientists often use a multitude of different software tools for different stages of their research, such as data analysis, visualization, and modeling. These tools may be developed by different researchers and organizations, and therefore may have different data formats, languages, and user interfaces. As a result, integrating these tools and exchanging data between them can be complex and time-consuming.

The Tool API addresses this challenge by providing a standardized interface for tools. It allows researchers to easily connect and share data between tools, regardless of their underlying technologies. The paper emphasizes that the Tool API is not a single universal solution, but rather a flexible framework that can be adapted to different scientific domains and workflows.

The importance of interoperability is highlighted, as the Tool API enables researchers to combine different tools and leverage their specific functionalities. This can lead to more efficient and collaborative research, as researchers can build upon existing tools and integrate them seamlessly into their workflows.

Overall, the Category section provides an overview of the Tool API and its significance in scientific research. It emphasizes the need for interoperability and standardization in the integration of diverse software tools, and highlights the potential benefits of using the Tool API in improving scientific workflows.


DATASET CONSTRUCTION


The ToolBench dataset construction process is outlined in three stages: API collection, instruction generation, and solution path annotation. The process utilizes ChatGPT, requiring minimal human supervision and enabling easy extension to new APIs. The section starts by introducing RapidAPI, an API marketplace that connects developers with real-world APIs. RapidAPI organizes APIs into coarse-grained categories and more fine-grained collections based on characteristics and functionalities. This hierarchical structure provides a valuable resource for the language model to understand and utilize APIs effectively. The dataset construction then involves crawling information for each tool, including tool name, description, host URL, and available APIs, recorded with details such as name, description, HTTP method, parameters, code snippets, and example responses. To ensure the reliability and functionality of the tool set, a rigorous filtering process is performed. Initial testing is conducted to discard APIs that do not meet basic functionality criteria. Example response evaluation is then carried out, assessing response time and quality. APIs with consistently long response times or low-quality responses are filtered out, resulting in 3,451 high-quality tools and 16,464 APIs being retained for ToolBench.

API Response Compression

The API Response Compression section of the research paper focuses on handling long and redundant API responses to address the limited context length of language models (LLMs). The study aims to compress these responses while preserving vital information.

The researchers observe that some API responses contain unnecessary data, making them too long to be utilized effectively by LLMs. To mitigate this issue, they employ response compression techniques. Since each API has a fixed response format, they utilize ChatGPT, a conversational language model, to analyze a response example and remove unimportant keys within the response. This compression helps reduce the response length while retaining crucial details.

The process involves providing ChatGPT with a prompt containing relevant information for each API, including tool documentation, which comprises the tool and API descriptions, parameters, and an example response. Additionally, it includes three in-context learning examples for each API, consisting of an original API response and a compressed response schema created by experts. These examples help ChatGPT understand the functionality and compression strategies for all APIs.

During inference, if the length of an API response exceeds 2048 tokens, the compression technique is applied. It starts by removing unimportant information, and if the compressed response remains longer than 2048 tokens, only the first 2048 tokens are retained in the final compressed response.

In summary, the researchers propose a response compression approach to tackle lengthy and redundant API responses for better utilization within LLMs. They employ ChatGPT to analyze and compress the responses while maintaining important details, optimizing their usage in various applications.

INSTRUCTION GENERATION

The instruction generation process involves two important aspects: diversity and multi-tool usage. To ensure the generalizability and robustness of the language model (LLMs), a bottom-up approach is adopted. Instructions are crafted based on collected APIs, and sampling strategies are used to cover all APIs and their combinations, ensuring diversity. ChatGPT is prompted with APIs and their functionalities to generate instructions and relevant APIs. Seed examples written by human experts are used for in-context learning. Sampling strategies are adjusted for single-tool and multi-tool scenarios, considering sparsity and hierarchy information. Generated instructions are filtered and over 200k qualified instruction-relevant API pairs are collected. The decision-making process is cast as a multi-round conversation, where ChatGPT generates actions based on previous interactions and real API responses. The decision space is defined by the thought, available APIs, and possible parameters. The function call feature of gpt-3.5-turbo-16k is leveraged by treating each API as a special function. All sampled APIs are fed to ChatGPT, expanding the action space. Two additional functions, "Finish with Final Answer" and "Finish by Giving Up," are defined to complete the action sequence in different scenarios.

Depth First Search-based Decision Tree

The research paper discusses the limitations of conventional methods, such as CoT or ReACT, for decision making. These methods have two main limitations: error propagation and limited exploration. Error propagation occurs when a mistake in the initial action leads to further errors, causing the model to become trapped in a faulty loop. For example, continuously calling an API in the wrong way or hallucinating APIs can result in error propagation. Limited exploration is another drawback of these methods. Despite the infinite possibilities in the action space, CoT or ReACT only explore one possible direction, leading to a restricted exploration of the entire action space. As a result, even advanced models like GPT-4 struggle to find a valid solution path, making it challenging to annotate. These limitations in conventional methods highlight the need for an improved approach to decision making.

Preprint

The authors propose using a decision tree to expand the search space and improve the chances of finding a valid path. The DFSDT model allows for different reasoning paths and can continue on a promising path or abandon a node with a failed API call and expand a new node. To diversify child nodes and expand the search space, previously generated nodes are used as input to prompt ChatGPT to generate distinct nodes. Depth-first search (DFS) is preferred over breadth-first search (BFS) because finding one valid path is enough. BFS would require excessive OpenAI API calls before reaching a terminal node. The authors choose to perform pre-order traversal (a variant of DFS) for DFSDT to balance effectiveness with costs. This design achieves similar performance as DFS while significantly reducing costs. They generate 12,657 instruction-solution pairs, which are used to train ToolLLaMA. Although more training instances could be constructed, the authors find that 12,657 instances already provide satisfying generalization performance.


EXPERIMENTS


The experiments conducted in this section aimed to evaluate the performance of ToolLLaMA, a tool for assisting in completing tasks using APIs. The evaluation metric used for ToolLLaMA was introduced in section 3.1. The efficacy of API retriever and DFSDT was assessed in section 3.2. In section 3.3, the experiments and analyses were presented.

Due to the temporal variability of APIs, it was impractical to have a fixed ground-truth solution for each test instruction. To ensure consistency, the same API version was used across different models during evaluation. To make evaluation efficient, a machine evaluator called ToolEval was developed based on AlpacaEval, which incorporated two evaluation metrics: pass rate and win rate.

Pass rate measured the proportion of successfully completing an instruction within a specific number of actions. It assessed the executability of instructions and required models to finish the whole process with a specific action. Win rate compared solution paths for a given instruction based on predefined criteria, which were evaluated by ChatGPT evaluator through multiple annotations.

To validate the reliability of the ChatGPT evaluator, human preference (win rate) annotations were collected for solution pairs generated by three different methods. The ChatGPT evaluator demonstrated a high correlation of 75.8% with human annotators, indicating its similarity to human preferences. Additionally, the automatic evaluator displayed lower variance and higher consistency than humans when annotating multiple times for the same instruction.

Finally, it was found that both ReACT and DFS consumed similar OpenAI API calls per instruction.


API Retriever


The API retriever in this research paper focuses on retrieving relevant APIs (application programming interfaces) for instructions. They use Sentence-BERT to train a dense retriever based on BERT-BASE. The model encodes the instruction and API document into two embeddings and determines relevance based on the similarity of these embeddings. During training, they use the relevant APIs generated as positive examples and select a few APIs as negative examples for contrastive learning. They compare their retriever to baseline methods BM25 and OpenAI's text-embedding-ada-002 API, evaluating the retrieval performance using NDCG. The results show that their API retriever consistently outperforms the baselines across different types of instructions, indicating its effectiveness. They find that single-tool instructions are simpler for API retrieval compared to multi-tool instructions. They also validate the superiority of their method, DFSDT, over ReACT in solution path annotation. DFSDT outperforms the baselines in all scenarios and is more efficient and cost-saving. They observe that DFSDT performs better for harder instructions than for simpler ones, suggesting its ability to handle difficult and complex instructions. By including these "hard examples" in their dataset, they can fully assess the tool-use capabilities for complex scenarios.


MAIN EXPERIMENTS


The main experiments in this research paper focus on the ToolLLaMA model, which is fine-tuned using instruction-solution pairs. The original LLaMA model is pre-trained with a sequence length of 2048, but this is insufficient for the tool response, so positional interpolation is used to extend the context length to 8192. The model is trained in a multi-round conversation mode. Additional analyses of ToolLLaMA are conducted, including replacing the ground truth APIs with those recommended by their API retriever, degrading the reasoning method from DFSDT to ReACT, and tuning LLaMA using LoRA instead of full-parameter fine-tuning. Comparisons are made between each variant and the default ToolLLaMA for win rate. The use of their API retriever only slightly decreases the pass rate compared to the ground truth API set. The average win rate of ToolLLaMA is 49.8, demonstrating the excellent ability of their API retriever to retrieve relevant APIs. Comparisons between DFSDT and ReACT show that DFSDT achieves a significantly higher pass rate and is more preferred across scenarios, highlighting its superiority over ReACT in decision-making tasks. Additionally, the comparison between ToolLLaMA and ChatGPT shows that DFSDT has a greater impact on ToolLLaMA, indicating the importance of expanding the search space for LLMs with inferior reasoning capabilities. These findings suggest the potential utility of applying DFSDT to small-scale models in practice.


ToolLLaMA with Better Parameter Efficiency


The section discusses the improvement of parameter efficiency in a tool called ToolLLaMA, which is developed by fine-tuning all the parameters of LLaMA. In order to further enhance the parameter efficiency, the researchers apply a representative parameter-efficient tuning method and analyze its impact on performance. The results presented in a table demonstrate that this approach leads to improved parameter efficiency, but at the cost of performance. The researchers suggest that future efforts should focus on developing more advanced techniques that can achieve parameter efficiency without compromising performance.


RELATED WORK


This section of the research paper discusses the related work in three areas: tool learning, instruction tuning and data augmentation, and prompting LLMs for decision making. In terms of tool learning, open-source LLMs are not as proficient as state-of-the-art (SOTA) LLMs in using tools, and the mechanism by which SOTA LLMs acquire tool-use ability is not well understood. The paper aims to bridge this gap. Instruction tuning improves LLMs' understanding of human instructions and generating appropriate responses. Self-instruct proposes generating high-quality instruction tuning data from SOTA LLMs, but tool learning is more challenging due to the diversity of APIs and complexity of multi-tool instructions. Existing tool-learning datasets and methods are still in their early stages and do not effectively address real human needs. The ToolBench developed in this research aims to address these limitations. Prompting LLMs involves decomposing high-level tasks into sub-tasks and generating grounded plans. ReACT integrates reasoning with acting, but lacks a mechanism for decision retraction, leading to a cascade of errors. Reflexion addresses this issue by asking LLMs to reflect on previous failures, but DFSDT extends this approach by allowing LLMs to assess different reasoning paths and select the most promising one. DFSDT targets general decision-making problems with infinite decision spaces, unlike tree-of-thought (ToT) reasoning, which is designed for simpler tasks. The paper's method is designed for diverse decision-making tasks, while ToT is tailored specifically for its selected task set.


CONCLUSION


In this research paper, the authors introduce a method to elicit tool-use capabilities within language model learners (LLMs). They present a dataset called ToolBench, which includes over 16,000 real-world APIs and various use-case scenarios for single-tool and multi-tool tasks. The construction of ToolBench is done using ChatGPT and requires minimal human supervision. The authors also propose a method called DFSDT to enhance the planning and reasoning ability of LLMs, enabling them to navigate reasoning paths strategically. To evaluate tool learning efficiently, they develop an automatic evaluator called ToolEval.

By fine-tuning LLaMA on ToolBench, the authors obtain a model called ToolLLaMA that performs on par with ChatGPT and demonstrates remarkable generalization ability to unseen APIs. They also develop a neural API retriever that recommends relevant APIs for each instruction, which can be integrated with ToolLLaMA as a more automated tool-use pipeline.