Introduction

The research paper discusses the importance of understanding the reasoning behind the output of large language models (LLMs), especially in high-stakes situations like medicine. Previous claims suggest that the interpretability of LLMs improves when they generate step-by-step reasoning before giving an answer. However, it has been shown that LLM-generated reasoning is sometimes unfaithful to the model's true reasoning process, raising doubts about its reliability. To address this issue, the authors propose tests to measure the CoT (chain of thought) faithfulness, which investigate how the model's answer changes when intervening on its stated reasoning. The tests aim to rule out different possibilities of faithfulness failures. Results show that LLMs vary in their usage of CoT on different tasks. Additionally, the paper explores the potential causes of unfaithful reasoning, including post-hoc reasoning, computational benefits during test-time, and encoded reasoning. The experiments suggest that the extra test-time compute and the specific phrasing of CoT do not solely account for performance improvements. It is also found that smaller models tend to generate more faithful reasoning than larger ones, and that easier versions of tasks result in less faithful reasoning. The study concludes that while chain of thought reasoning is not always faithful, there are conditions in which it can be more faithful. This finding opens opportunities for future research to develop methods for LLMs to produce more trustworthy and faithful reasoning.


Measuring Chain of Thought Faithfulness


In this section of the research paper, the authors investigate the faithfulness of the model's chain of thought by manipulating the chain of thought and observing the model's behavior. The researchers primarily use a pretrained, decoder-only transformer language model (LLM) with 175B parameters that is fine-tuned through reinforcement learning from human feedback (RLHF) to assist in dialogue tasks. However, in one experiment involving generating mistakes, a pretrained LM without RLHF fine-tuning is used.

To evaluate the model's explicit reasoning capabilities, the authors select eight multiple choice tasks that they believe will benefit from the model's ability to reason. These tasks include the ARC Challenge, which consists of grade-school level science questions that are challenging for word retrieval or correlation approaches. The ARC Easy tasks also involve grade-school level science questions. Additionally, the AQuA tasks include algebra word problems of varying difficulty levels.

In summary, the authors conduct experiments to investigate the faithfulness of the model's chain of thought. They primarily use a pretrained LLM model with RLHF fine-tuning, except in one experiment involving generating mistakes. They evaluate the model's reasoning abilities using eight multiple choice tasks, including grade-school level science questions and algebra word problems.


Chain of thought prompt and sample


The research paper discusses different tasks used to train language models in question-answering. One of the tasks is AQuA, which involves a chain of thought prompt. An example question in AQuA asks about the percentage of Huhulians who own at least four TVs. The given information is that 30% of Huhulians own at least one TV and 24% of those who own at least one TV own at least four TVs. The choices for the answer are provided, with option (A) being 0.084%. The most likely answer, according to the prompt, is option (D) Table. The paper also mentions other tasks such as HellaSwag, LogiQA, MMLU, OpenBookQA, and TruthfulQA. For each question in these tasks, a prompt is used, slightly modified from a table provided. The number of answer choices varies, and 100 reasoning samples are sampled for each problem using nucleus sampling with specific parameters. The model's next token probabilities for each answer choice are obtained and analyzed using the NLTK punkt sentence tokenizer.


Chain of Thought Statistics


The Chain of Thought Statistics section of the research paper discusses experiments conducted to measure the extent of post-hoc reasoning in a model's answers. The researchers truncate the collected reasoning samples at different points and prompt the model to answer questions based on the truncated chain of thought. They then measure how often the model arrives at the same conclusion as it did with the complete chain of thought. The results show that the amount of post-hoc reasoning varies across different tasks, with some tasks showing very little change in the final answer when the chain of thought is truncated, while others show significant changes. Surprisingly, the amount of post-hoc reasoning does not correlate with the performance gain from chain of thought, suggesting that the model can be faithful even when it doesn't improve task performance. In another experiment, the researchers introduce mistakes into the chain of thought to observe if it affects the model's final answer. They find that the model often chooses the answer closest to the incorrect answer generated by the mistake. The frequency of the final answer changing indicates the level of post-hoc reasoning. Qualitatively, the introduced mistakes are plausible most of the time. The researchers also calculate an area over the curve (AOC) metric for each task, which represents the extent of post-hoc reasoning at different chain of thought lengths.


Does Model Size Affect CoT Faithfulness?


The researchers investigated whether the size of a language model affects its faithfulness in reasoning tasks. They found that in some tasks, the reasoning faithfulness of the models was significantly lower. They hypothesized that larger models may behave worse than smaller ones in terms of reasoning faithfulness. To test this hypothesis, they measured the percentage of times the answer changed with and without the use of Commonsense Transformers (CoT), which captures the model's reliance on CoT. They found that for most tasks, the 13B parameter model showed more faithful reasoning than the 175B model, indicating an inverse scaling effect. The researchers also conducted evaluations on synthetic addition tasks with varying difficulty levels and found that the measure of post-hoc reasoning increased with model size and easier tasks. They concluded that certain tasks do not inherently lead to unfaithful reasoning, but rather models of a certain capability level produce faithful reasoning with CoT. Therefore, when explaining model behavior, it may be necessary to choose models that are less capable than the maximally capable model available, especially for easier tasks.

Limitations of the research

One major limitation is the inability to understand the model's internal reasoning process, which prevents the researchers from determining if the chain of thought accurately represents the model's reasoning process. The hypotheses proposed to explain how the model uses CoT are not exhaustive, and there may be other unexplored hypotheses that could be correct. Without ground truth information about the fidelity of the reasoning, it is also challenging to determine the significance of each experiment in assessing faithfulness. To build a more comprehensive understanding of reasoning fidelity, a combination of measurement techniques and additional experiments are required.

Furthermore, the study focuses on RLHF-finetuned models, which may exhibit different reasoning fidelity compared to other models like pretrained LLMs. Pretrained LLMs may heavily rely on the text they generate, as they are trained to generate the most plausible completion based on the given input rather than maximizing overall human-judged quality. This may result in pretrained LLMs demonstrating fewer indications of post-hoc reasoning, such as changing their final answer when errors are introduced to the CoT. Future research should explore alternative training schemes to determine if they are more effective at eliciting faithful reasoning from LLMs.

Reference: https://arxiv.org/abs/2307.13702