Key Points
1. The paper proposes several efficient router models that dynamically select between a stronger and a weaker Large Language Model (LLM) during inference, aiming to optimize the balance between cost and response quality.
2. The authors develop a training framework for these routers that leverages human preference data and data augmentation techniques to enhance performance.
3. The evaluation on widely-recognized benchmarks shows that the approach can significantly reduce costs, by over 2 times in certain cases, without compromising the quality of responses.
4. Interestingly, the router models also demonstrate significant transfer learning capabilities, maintaining their performance even when the strong and weak models are changed at test time.
5. The paper formulates the LLM routing problem to explore the trade-off between cost and response quality.
6. The authors propose a router training framework based on human preference data and augmentation techniques.
7. The paper open-sources the code and preference data used to train the routers.
8. The paper develops evaluation metrics that capture the trade-off between cost and quality in the LLM routing problem, including call-performance threshold and average performance gap recovered.
9. The paper explores different approaches for parameterizing the win prediction model, including similarity-weighted ranking, matrix factorization, BERT classifier, and causal LLM classifier.
Summary
This research paper presents an efficient router model that dynamically selects between a stronger and a weaker large language model (LLM) during inference, aiming to optimize the balance between cost and response quality.
Addressing Trade-Offs Between LLM Performance and Cost
The key challenges addressed are the trade-off between the performance and cost of different LLMs. Powerful LLMs like GPT-4 are highly effective but expensive, while smaller models like Mixtral-8x7B are more cost-effective but less capable. The proposed routing approach seeks to intelligently route queries to the appropriate model to maximize quality while minimizing costs.
The researchers develop a training framework for these routers that leverages human preference data and data augmentation techniques. The human preference data captures pairwise comparisons between different LLM models on a variety of queries. The researchers also explore augmenting the training data with golden-labeled datasets like MMLU and LLM-judge-labeled datasets.
Evaluation of the Proposed Routing Approach
The evaluation on widely-recognized benchmarks like MMLU, MT Bench, and GSM8K shows that the proposed routing approach can significantly reduce costs - by over 2 times in certain cases - without substantially compromising the quality of responses. Interestingly, the router models also demonstrate significant transfer learning capabilities, maintaining their performance even when the strong and weak models are changed at test time.
The key contributions of this work are: 1) Formulating the LLM routing problem to explore the trade-off between cost and response quality, 2) Proposing a router training framework based on human preference data and augmentation techniques, and 3) Demonstrating over 2x cost savings on widely used benchmarks.
Overall, this work provides a promising solution for deploying LLMs in a cost-effective yet high-performance manner by intelligently routing queries to the most appropriate model.
Reference: https://arxiv.org/abs/2406.18665v2