Key Points

1. Large Language Models (LLMs) have made significant strides in tasks like mathematical reasoning and program synthesis, but struggle to effectively use tools via API calls due to challenges in generating accurate input arguments and hallucinating wrong API usage.

2. The paper introduces Gorilla, a fine-tuned LLaMA-based model that surpasses GPT-4 in writing API calls, demonstrating strong adaptability to document changes and mitigating hallucination issues commonly encountered with prompting LLMs directly.

3. The researchers introduce APIBench, a comprehensive dataset consisting of HuggingFace, TorchHub, and TensorHub APIs, to evaluate the model's ability and demonstrate the successful integration of the retrieval system with Gorilla.

4. Recent advances in LLMs have renovated many downstream domains and enabled new capabilities, and while prior work has integrated tools into LLMs, it generally focused on specific tools rather than exploring a vast array of tools (API calls) in an open-ended fashion.

5. The discussion explores the use of self-instruct fine-tuning and retrieval to enable LLMs to accurately select from a large, overlapping, and changing set of tools expressed using APIs and API documentation.

6. Gorilla significantly outperforms GPT-4 in terms of API functionality accuracy and reduces hallucination errors, demonstrating its ability to adapt to changes in API documentation during retrieval-aware training.

7. Evaluations of Gorilla with different retrieval methods show that incorporating ground truth retriever in the finetuning pipeline achieves significantly better results, but non-optimal retrievers at evaluation time may misguide the model and result in more errors.

8. Gorilla's retriever-aware training allows it to adapt to test-time changes in API documentation, maintaining its accuracy and relevance over time and adjusting to shifts in API sources, enhancing its practical utility for API calls.

9. Gorilla's reliable API calls to ML models without hallucination, impressive adaptability to test-time API usage changes, and capability to satisfy constraints while picking APIs demonstrate its potential for effectively utilizing tools via API calls, contributing to fair and optimized usage of machine learning.

Summary

In this paper, the authors focus on enhancing the capability of Large Language Models (LLMs) to effectively utilize tools via API calls. They introduce Gorilla, a finetuned LLaMA-based model that outperforms state-of-the-art LLMs in writing API calls and demonstrates the ability to adapt to test-time document changes. The authors also introduce APIBench, a comprehensive dataset consisting of HuggingFace, TorchHub, and TensorHub APIs, and demonstrate Gorilla's successful integration with a document retriever.

Limitations of LLMs in Utilizing Tools
The study highlights recent advances in LLMs and emphasizes the limitations of LLMs in using tools provided by API calls. The authors discuss the potential for LLMs to become the primary interface to computing infrastructure and the web, but note that the integration of tools into LLMs has primarily focused on specific, well-documented APIs. The paper introduces Gorilla, a model trained to use a large, overlapping, and changing set of tools expressed using APIs and API documentation.

Dataset Collection and Model Evaluation
The authors describe in detail the process of dataset collection, instruction generation, and model training. They also present the results of benchmarking Gorilla with other LLMs, exploring how different retrieval methods impact the model's performance in making API calls, and evaluating the model's ability to understand constraints. The findings show that Gorilla significantly outperforms other LLMs in accurately selecting API calls and reducing hallucination errors. Additionally, the study demonstrates Gorilla's ability to adapt to test-time changes in API documentation and reason about API calls under constraints.

The paper concludes by discussing the implications of the research, including the potential for LLMs to interact with tools in the wider world more effectively by using APIs, and the release of an extensive dataset consisting of over 11,000 instruction-API pairs for studying and benchmarking existing APIs.

Reference: https://arxiv.org/abs/2305.15334