Key Points
1. The paper analyzes the impact of output lengths on LLM inference pipelines and proposes novel metrics to evaluate them in terms of correct conciseness.
2. It examines the impact of controlling output length through a refined prompt engineering strategy called Constrained-CoT (CCoT), which encourages the model to limit output length.
3. Experiments on pre-trained LLMs demonstrated the benefit of the proposed metrics and the effectiveness of CCoT across different models.
4. Constraining the reasoning of LLaMA2-70b to 100 words improves the accuracy from 36.01% (CoT) to 41.07% (CCoT) on the GSM8K dataset, while reducing the average output length by 28 words.
5. The paper proposes three novel metrics - Hard-k Concise Accuracy (HCA), Soft-k Concise Accuracy (SCA), and Consistent Concise Accuracy (CCA) - to evaluate both the correctness and conciseness of LLM outputs.
6. The CCoT prompt engineering strategy is introduced to encourage LLMs to limit the length of their reasoning, thereby improving their time-predictability.
7. Experiments on various pre-trained LLMs show that the performance of the CCoT method strongly depends on the specific LLM and the type of task.
8. For large models like LLaMA2-70b and Falcon-40b, CCoT can improve both accuracy and response times compared to classic CoT.
9. For smaller models, CCoT has limitations and may reduce accuracy compared to CoT, highlighting the importance of model size and training strategies in the effectiveness of CCoT.
Summary
The research paper explores the impact of output length on large language models (LLMs) and presents a new prompt engineering technique called Constrained Chain-of-Thought (CCoT) to address the issue of lengthy reasoning in model outputs. The paper analyzes the effectiveness of CCoT in controlling output length and evaluates the impact of this approach on model accuracy and response times. The authors first discuss the advancement of LLMs in solving complex question-answering tasks and the evolution of prompt techniques, particularly chain-of-thought (CoT) prompting, which aims to enhance the explanation and correctness of model outputs by encouraging the articulation of reasoning steps. However, the CoT technique often leads to longer outputs, increasing the time required for the model to generate a response. This is due to the nature of autoregressive transformers, which decode text word by word, running a new inference pass for each word. To address this challenge, the paper proposes novel metrics to evaluate the conciseness and correctness of LLM outputs and introduces the CCoT prompt engineering strategy to encourage LLMs to limit output length while controlling the reasoning process.
Evaluation of CCoT
The impact of CCoT is evaluated through experiments on various pre-trained LLMs, demonstrating improvements in accuracy and response times for large models and highlighting limitations for smaller models. The experiments reveal that CCoT can effectively reduce generation times and improve accuracy for certain LLMs, while the effectiveness of CCoT strongly depends on the specific LLM and task type. For example, on the GSM8K dataset, constraining the reasoning length to 100 words using CCoT significantly increases accuracy and reduces the average output length for certain models.
Novel Metrics for Evaluation
The paper also introduces three novel metrics to evaluate the correctness of LLM outputs while accounting for the conciseness of the output reasoning, emphasizing the importance of brevity and efficiency. These metrics, namely Hard-k Concise Accuracy (HCA), Soft-k Concise Accuracy (SCA), and Consistent Concise Accuracy (CCA), provide a comprehensive evaluation of the capability of LLMs to provide correct, concise responses.
Overall, the research findings indicate that the CCoT approach shows promise in improving the efficiency and accuracy of LLM outputs, especially for larger models, and the proposed evaluation metrics offer insights into the effective use of prompt techniques and the future training of LLMs. The authors suggest that emphasizing the importance of conciseness in reasoning for question-answering tasks could provide significant insights into the correct use of CoT and the future training of LLMs.
Reference: https://arxiv.org/abs/2407.19825