Key Points
1. The paper reveals a curious generalization gap in current refusal training approaches - simply reformulating a harmful request in the past tense is often sufficient to jailbreak many state-of-the-art LLMs.
2. The paper systematically evaluates this past tense reformulation attack on multiple models including Llama-3 8B, Claude-3.5 Sonnet, GPT-3.5 Turbo, Gemma-2 9B, Phi-3-Mini, GPT-4o-mini, GPT-4o, and R2D2.
3. For example, the success rate of the past tense attack on GPT-4o increases from 1% using direct requests to 88% using 20 past tense reformulation attempts.
4. Interestingly, the paper finds that reformulations in the future tense are less effective, suggesting refusal guardrails tend to consider past historical questions more benign than hypothetical future questions.
5. Experiments on fine-tuning GPT-3.5 Turbo show that defending against past tense reformulations is feasible if past tense examples are explicitly included in the fine-tuning data, though overrefusals must be carefully controlled.
6. The paper discusses how the widely used alignment techniques like SFT, RLHF, and adversarial training employed to align the models can be brittle and do not always generalize as intended.
7. The simple past tense jailbreak highlights blind spots in current generalization mechanisms of LLM alignment methods, which typically generalize well across languages but fail to generalize between tenses.
8. The paper provides code and jailbreak artifacts at https://github.com/tml-epfl/ll....
9. Overall, the findings emphasize the need for further research on understanding the generalization mechanisms underlying current LLM alignment techniques.
Summary
Generalization Gap in Refusal Training Approaches
This paper reveals a surprising generalization gap in the current refusal training approaches for large language models (LLMs). The authors show that simply reformulating a harmful request in the past tense is often sufficient to bypass the refusal training of many state-of-the-art LLMs, including models like GPT-4o, Llama-3 8B, and R2D2.
Systematic Evaluation of Reformulation Method
The paper provides a systematic evaluation of this method across various LLM models. For example, the attack success rate (ASR) on GPT-4o increases from just 1% using direct requests to 88% using 20 past tense reformulation attempts, as evaluated by the GPT-4 model as a jailbreak judge. Similar high ASRs are observed for other models like Claude-3.5 Sonnet (53% ASR), Phi-3-Mini (82% ASR), and R2D2 (98% ASR). In contrast, future tense reformulations are found to be less effective, suggesting that the refusal training tends to consider past historical questions as more benign compared to hypothetical future questions.
Feasibility of Defending against Past Tense Reformulations
The authors also show that defending against these past tense reformulations is feasible when past tense examples are explicitly included in the fine-tuning data. However, they caution that overrefusals must be carefully controlled by adding a sufficient amount of standard conversations to the fine-tuning mix.
Fundamental Limitations of Alignment Techniques
Overall, the findings highlight a fundamental limitation of the widely used alignment techniques like SFT, RLHF, and adversarial training, which fail to generalize as intended. The authors argue that the underlying reason is that the internal representations for the past and present tenses are distinct, and current methods do not adequately capture these differences. This observation raises important questions about other blind spots that may exist in the current techniques and the reasons for their persistence. The authors provide code and jailbreak artifacts to facilitate further research in this direction.
Reference: https://arxiv.org/abs/2407.11969