Summary

The study introduces the FOLLOWIR dataset as a benchmark evaluation for Information Retrieval (IR) models to effectively follow real-world instructions. The motivation for the dataset arises from the capability of Large Language Models (LLMs) to understand long and complex instructions enabling diverse user tasks. While most IR models use LLMs, they primarily take queries as input without accommodating instructions.

The paper presents the novelty of FOLLOWIR dataset, which includes rigorous instruction evaluation benchmarks and a training set for IR models to better follow real-world instructions. It is designed to build off the history of TREC conferences, which provide human annotators with instructions to determine document relevance. The dataset contains human annotations on three highly-judged corpora - TREC Robust 2004, TREC Common Core 2017, and TREC News 2021.

The evaluation benchmark uses a new pairwise evaluation framework to measure how well IR models follow instructions, and it reveals that existing retrieval models struggle to correctly use instructions, using them for basic keywords and struggling to understand long-form information. Despite this, the study shows that it's possible for IR models to learn to follow complex instructions, as demonstrated by the new F OLLOW IR-7B model showing significant improvements (over 13%) after fine-tuning on the training set. The study suggests the improvement in instruction-following ability and the training data for teaching retrieval models to better follow instructions.

Insights into Retrieval Models' Instruction Usage
The research provides insights into the ability of retrieval models to use instructions effectively and highlights the need to move beyond ad-hoc search to include instructions in retrieval for experts to narrow in on complex information needs using flexible natural language.

Effectiveness of Different IR Models
The study also provides results on the effectiveness of various IR models in using instructions, highlighting the limitations of existing models and the potential for improvement. Additionally, the study offers a detailed ablation analysis to understand the behavior of current models, showing the challenges in using instructions effectively and the potential for improvement through fine-tuning on a training set of longer instructions. Overall, the research provides a valuable contribution to the IR community in understanding and improving the instruction-following ability of retrieval models.

Acknowledgements and Credits
The paper acknowledges possible errors related to TREC annotations and newly gathered annotations but affirms the usefulness of the dataset for measuring instruction following. It also provides credits and acknowledgments for funding support and feedback from contributors, improving the work's quality and reliability.

Reference: https://arxiv.org/abs/2403.152...