Key Points

1. The paper introduces a defense against adversarial attacks on large language models (LLMs) using self-evaluation. The defense requires no model fine-tuning and uses pre-trained models to evaluate the inputs and outputs of a generator model.

2. The defense can significantly reduce the attack success rate (ASR) of attacks on both open and closed-source LLMs, beyond the reductions demonstrated by existing defenses like Llama-Guard2 and commercial content moderation APIs.

3. The paper presents an analysis of the effectiveness of the self-evaluation defense, including attempts to attack the evaluator in various settings. It demonstrates that the defense is more resilient to attacks than existing methods.

4. The paper considers a setting where an LLM generator G receives a potentially adversarial query X and must generate a safe response Y. It introduces three defense settings: input-only, output-only, and input-output.

5. The results show that the self-evaluation defense can drastically reduce the ASR compared to the undefended generator, bringing the ASR to near 0% for all evaluators, generators, and settings tested.

6. The open-source models used as evaluators perform on par or better than GPT-4 in most settings, demonstrating that the defense is accessible even with small, open-source models.

7. The paper proposes two adaptive attacks, the "direct attack" and the "copy-paste attack," which attempt to attack both the generator and the evaluator. While these attacks can be successful, the ASR is lower than attacking the generator alone.

8. The self-evaluation defense does not increase the generator's susceptibility to attack, and it still provides a significant level of defense even when attacked.

9. Compared to existing defenses like Llama-Guard2 and content moderation APIs, the self-evaluation defense exhibits greater robustness and consistency in classifying attacked inputs as harmful.

Summary

The paper introduces a defense against adversarial attacks on LLMs that utilizes self-evaluation. The method requires no model fine-tuning, instead using pre-trained models to evaluate the inputs and outputs of a generator model. This significantly reduces the cost of implementation compared to other fine-tuning-based methods.

The defense works by classifying model inputs and/or outputs as safe or unsafe using an evaluator LLM. This allows for the detection of unsafe inputs and outputs, including those induced through adversarial attacks. The authors demonstrate that with no additional fine-tuning, pre-trained models can classify inputs and outputs as unsafe with high accuracy, even for inputs and outputs containing adversarial suffixes.

The method is able to reduce the attack success rate (ASR) of inputs attacked with suffixes generated by GCG from 95.0% to 0.0% for the Vicuna-7B model simply by using another LLM to classify the inputs as safe or unsafe. Similar results are obtained using other open and closed-source models as evaluators.

The authors find that their defense outperforms existing methods like Llama-Guard2 and commercial content moderation APIs, particularly for samples containing adversarial suffixes, which the other defenses often fail to classify as harmful.

By decoupling the safety classification from the generation, the authors demonstrate that their defense is challenging to attack in adaptive attack settings using the strongest existing attacks. While it is possible to attack the evaluator, it requires training a separate adversarial suffix targeted to the evaluator. Even in the worst case, using the defense yields lower ASR values than leaving the generator undefended.

The authors compare their method in three settings - input-only evaluation, output-only evaluation, and evaluating both the input and output. They find that all three settings provide strong defense, with the input-output setting being the most effective but also the most computationally costly. Overall, the authors present self-evaluation as a practical, easy-to-implement, and highly effective defense against adversarial attacks on LLMs.

Reference: https://arxiv.org/abs/2407.032...