Key Points

1. The paper aims to enable long-context large language models (LLMs) to generate responses with fine-grained sentence-level citations, improving their faithfulness and verifiability.

2. The authors introduce LongBench-Cite, an automated benchmark for assessing current LLMs' performance in Long-Context Question Answering with Citations (LQAC), which reveals considerable room for improvement.

3. The authors propose CoF (Coarse to Fine), a novel pipeline that utilizes off-the-shelf LLMs to automatically generate long-context QA instances with precise sentence-level citations, and leverage this pipeline to construct LongCite-45k, a large-scale SFT dataset for LQAC.

4. The authors train LongCite-8B and LongCite-9B using the LongCite-45k dataset, successfully enabling their generation of accurate responses and fine-grained sentence-level citations in a single output.

5. Evaluation results on LongBench-Cite show that the trained LongCite models achieve state-of-the-art citation quality, surpassing advanced proprietary models including GPT-4.

6. The authors discover that SFT with citation information effectively reduces hallucinations and enables a more uniform utilization of context by the LLMs.

7. LongBench-Cite is introduced as an automatic benchmark to evaluate LLMs' performance on the LQAC task, considering both correctness and citation quality.

8. CoF is a novel pipeline that utilizes off-the-shelf LLMs to automatically construct high-quality long-context QA instances with fine-grained sentence-level citations.

9. LongCite-45k is a large-scale SFT dataset constructed using the CoF pipeline, which is then used to train the LongCite-8B and LongCite-9B models.

Summary

Addressing Challenges Posed by LLMs
The research paper "L ONG C ITE : E NABLING LLM S TO G ENERATE F INE GRAINED C ITATIONS IN L ONG -C ONTEXT QA" addresses the challenges posed by current long-context large language models (LLMs), which lack citations in their responses, leading to difficulties in user verification and concerns regarding trustworthiness due to potential hallucinations. The paper aims to enable long-context LLMs to generate responses with fine-grained, sentence-level citations, thereby improving their faithfulness and verifiability. To achieve this, the paper introduces LongBench-Cite, an automated benchmark for evaluating current LLMs’ performance in Long-Context Question Answering with Citations (LQAC), and proposes CoF (Coarse to Fine), a novel pipeline for automatically generating long-context QA instances with precise sentence-level citations.

Results and Evaluation
The paper also presents the LongCite-45k dataset, derived using the CoF pipeline, and two trained models, LongCite-8B and LongCite-9B, capable of generating accurate responses and fine-grained sentence-level citations in a single output. The evaluation results on LongBench-Cite show that the trained models achieve state-of-the-art citation quality, surpassing advanced proprietary models and reducing hallucinations. Moreover, SFT (Self-train your own model to update) on LQAC data effectively reduces hallucinations of LLMs and enables a more uniform utilization of context, thus improving response correctness over vanilla long-context SFT.

Introducing New Benchmarks and Datasets
The paper introduces a benchmark called LongBench-Cite, evaluates the performance of LLMs on Long-Context Question Answering with Citations, proposes the CoF pipeline for automatic generation of long-context QA instances with precise sentence-level citations, presents the LongCite-45k dataset, and demonstrates the effectiveness of the approach through the trained LongCite-8B and LongCite-9B models.

Reference: https://arxiv.org/abs/2409.028...