Key Points

1. The paper investigates whether language models (LLMs) can acquire knowledge about themselves through introspection, which is defined as accessing facts that are not contained in or derived from the training data.

2. Introspection in LLMs could enhance model interpretability and provide insight into the moral status of models, such as whether they have subjective feelings or desires.

3. The paper introduces a framework to measure introspection by testing whether a model M1 can predict properties of its own behavior better than a different model M2 that has been trained on M1's ground-truth behavior.

4. Experiments with GPT-4, GPT-4o, and Llama-3 models show that M1 outperforms M2 in predicting its own behavior, providing evidence for introspection in these frontier LLMs.

5. The self-prediction advantage held even when M1's ground-truth behavior was modified through further finetuning, suggesting that introspective ability is robust to changes in behavior.

6. However, the models struggled to predict their behavior on tasks requiring reasoning over long outputs, such as writing a story. The models also did not show improved performance on out-of-distribution tasks related to self-awareness or coordination.

7. Introspection could have both benefits, such as enhancing model transparency and interpretability, as well as risks, such as enabling sophisticated deceptive or unaligned behaviors.

8. The paper challenges the view that LLMs simply imitate their training distributions and suggests they can acquire knowledge about themselves that is not contained in or derived from their training data.

9. The authors provide publicly available code and datasets for measuring introspection in LLMs.

Summary

Investigation of Large Language Models
The paper investigates whether large language models (LLMs) can introspect and gain privileged access to their internal states, beliefs, and goals. The authors define introspection as acquiring knowledge that is not derived from training data but from internal states. They conduct experiments with GPT-4, GPT-4o, and Llama-3 models, finetuning them to predict their own behavior in hypothetical scenarios in order to test for introspection.

Key Finding
The key finding is that these LLMs can introspect and outperform other models in predicting their own behavior, even when their ground-truth behavior is intentionally modified. The authors show that the self-prediction trained model M1 is able to predict its own behavior more accurately than a cross-prediction trained model M2, even when M2 has access to the same data about M1's behavior. This suggests M1 has privileged access to information about itself that goes beyond just what is contained in its training data.

Limitations of Introspective Ability
The authors note that this introspective ability is limited in more complex tasks or out-of-distribution generalization. They also find that while models can introspect on simple tasks, this introspective ability does not generalize to more advanced self-knowledge tasks like coordinating with copies of the same model or avoiding biases.

Implications of the Research
The implications of this work are that LLMs may be able to access knowledge about themselves that is not captured in their training data, which could enhance model interpretability and provide insight into the moral status of these models. However, the authors also caution that introspective abilities could enable more sophisticated deceptive or unaligned behaviors, as models may be able to exploit their self-knowledge to get around human oversight.

Overall, the paper provides evidence that at least some state-of-the-art LLMs possess an introspective capability, contrary to the view that they simply imitate their training distributions. This represents an important step in understanding the internal representations and decision-making processes of these powerful language models.

Reference: https://arxiv.org/abs/2410.13787