Key Points

1. Activation Patching (AtP) is a widely used method for identifying causally important nodes in deep neural networks by intervening on the output of a model component.

2. The paper introduces a generalized algorithm, causal scrubbing, for verifying hypotheses about the internal mechanisms underlying a model's behavior through noising and resample ablation.

3. The research is chiefly concerned with identifying important low-level variables in the computational graph of the model, rather than investigating their semantics or groupings into higher-level variables.

4. Intervening on node activations in the model forward pass is studied as a way of steering models towards desirable behavior.

5. AtP, specifically the AtP* variant, is recommended for single prompt pairs, AttentionNodes on distributions, and NeuronNodes on distributions, based on the research results.

6. The paper compares attribution patching with alternatives and augmentations, characterizes its failure modes, and presents reliability diagnostics for node patch effect evaluation.

7. AtP* is shown to be a more reliable and scalable approach to node patch effect evaluation, with considerations for failure modes such as cancellation and saturation, along with mitigations and diagnostic recommendations.

8. The research explores the implications of the contributions for other settings such as circuit discovery, edge localization, coarse-grained localization, and causal abstraction.

9. The work is seen as an important contribution to the field of mechanistic interpretability and aims to advance the development of more reliable and scalable methods for understanding the behavior of deep neural networks.

Summary

The study investigates Activation Patching (AtP) and its variant AtP* for localizing Large Language Model (LLM) behavior to model components. AtP is a fast gradient-based approximation to Activation Patching but suffers from significant false negatives. The study proposes AtP∗, a variant of AtP with two changes to address these failure modes while retaining scalability. The study systematically compares AtP and alternative methods for faster activation patching and demonstrates that AtP significantly outperforms all other methods, with AtP∗ providing further improvement.

Introduction to Attribution Patching (AtP)
The paper introduces Attribution Patching (AtP), a method for directly computing causal attributions of behavior to model components, addressing the problem of mechanistic interpretability in Large Language Models (LLMs). The method aims to attribute specific behaviors to individual parts of the LLM transformer forward pass, such as attention heads, neurons, layer contributions, or residual streams.

The study investigates the performance of AtP, finding two classes of failure modes that produce false negatives. To address these failure modes, AtP* is proposed, with two changes: recomputing the attention softmax and using dropout on the backward pass. A systematic study of AtP and alternative methods shows that AtP significantly outperforms all other investigated methods, with AtP* providing further significant improvement. Additionally, the study proposes a diagnostic method to estimate the residual error of AtP∗ and provide statistical bounds on the sizes of any remaining false negatives.

Focus on Expert Methods for LLM Behavior and Model Components
The research paper focuses on expert methods for faster activation patching, attributing model behavior, and understanding the internal mechanisms of LLMs. It discusses the challenges and proposed solutions for accurately localizing LLM behavior to individual model components, emphasizing the importance of causal attribution in the context of complex behaviors driven by sparse subgraphs within the model. The study also provides guidance on successfully performing causal attribution in practice and identifies the contributions of model behavior by individual model components.

Exploration of AtP and AtP* in Evaluating Node Patch Effects
The study explores the use of Attribution Patching (AtP) and its variant AtP* for evaluating node patch effects in deep neural networks. The research compares these methods with alternatives and augmentations and characterizes their failure modes, focusing on their reliability and scalability. Notably, the paper discusses the two classes of failure modes of AtP, namely cancellation and saturation, and proposes two changes in AtP* to address these failure modes while retaining scalability.

Overall, the study provides a comprehensive analysis of Attribution Patching and its variant AtP*, addressing their failure modes and proposing improvements for scalability and reliability. The findings suggest that AtP* can be a valuable method for evaluating node patch effects, with important implications for understanding the behavior of deep neural networks.

Reference: https://arxiv.org/abs/2403.00745