Key Points

1. Introduces ChatQA, a family of conversational QA models with GPT-4 level accuracy without relying on synthetic data from ChatGPT models.

2. Proposes a two-stage instruction tuning method and a dataset curation recipe to significantly enhance LLM's capability of integrating user provided or retrieved context for zero-shot conversational QA tasks.

3. Demonstrates that fine-tuning a single-turn query retriever using the curated conversational QA data performs comparably to the state-of-the-art LLM-based query rewriting model, without incurring extra computational time and potential API costs associated with rewriting.

4. Conducts comprehensive evaluations on 10 conversational QA datasets and shows that the ChatQA-70B model outperforms GPT-3.5-turbo and performs on par with GPT-4.

5. Shows how incorporating a small amount of "unanswerable" samples significantly enhances the model's capability to handle scenarios where answers are unavailable.

Summary

The research paper introduces ChatQA, a family of conversational question answering (QA) models that achieve GPT-4 level accuracy without relying on synthetic data from OpenAI GPT models. The paper proposes a two-stage instruction tuning method and a dataset curation recipe that significantly enhances the model's capability to handle conversational QA tasks. The study involves comparing fine-tuning a single-turn query retriever with a state-of-the-art LLM-based query rewriting model and demonstrates comparable performance without incurring extra computational time and potential API costs associated with rewriting.

Additionally, the paper shows that incorporating a small amount of "unanswerable" samples can significantly enhance the model's capability to handle scenarios where answers are unavailable. The proposed method outperforms regular instruction tuning or RLHF-based recipes and presents a significant improvement in the unanswerable case evaluation compared to GPT-3.5-turbo. The study also investigates the effectiveness of different training datasets and demonstrates the importance of human-annotated data and conversational QA data in improving the model's performance.https://arxiv.org/abs/2401.10225

Overall, the results highlight the effectiveness of the proposed method in enhancing the model's conversational QA capability, positioning it as a strong competitor to GPT-4 in conversational QA tasks.

Reference: https://arxiv.org/abs/2401.10225