Key Points

1. The paper addresses the "Pink Elephant Problem" in language models, which involves instructing a model to avoid discussing a specific undesired entity or topic and instead focus on a desired alternative, and illustrates the challenges this presents for current language models.

2. The authors introduce a novel approach called Direct Principle Feedback (DPF) for Reinforcement Learning from AI Feedback (RLAIF), which directly applies critiques and revisions to fine-tune models to avoid discussing a topic specified at inference time.

3. A dataset of 162K conversations covering 29 diverse domains was created, focusing on generating contrastive pairs and diverse conversational themes to effectively train and evaluate the model's performance in avoiding the "Pink Elephant."

4. The authors compare their fine-tuning method with various baselines and demonstrate that the models trained with DPF perform better in avoiding mentioning the Pink Elephant when instructed to do so, outperforming Llama-2 chat and achieving performance on par with GPT-4.

5. The study also assessed the impact of their methodology on Open LLM Leaderboard tasks, demonstrating that the fine-tuned models retained their performance after training on the Pink Elephant dataset.

6. Ethical considerations regarding the reliance on AI for feedback loops and the need for transparency and ethical oversight in such systems are highlighted, along with potential implications for censorship and cultural context.

7. The paper points out the potential for future work to address more complex constraints, explore the generalization properties of the avoidance task, and extend the approach to controllability by applying it to safety-training setups.

8. The authors note the potential for flexible safety training, allowing downstream deployers to control a model's safety behaviors and properties based on their preferences.

9. The paper concludes with acknowledgments of individuals who provided feedback and support, along with details on funding sources for some of the work.

Summary

The paper presents the Pink Elephant Problem as a challenge in controlling language models (LLMs) to avoid discussing undesired entities at inference time. The authors propose a novel simplification of Constitutional AI, called Direct Principle Feedback (DPF), to address the issue. They apply DPF to fine-tune a 13B LLaMA 2 model on a synthetic Pink Elephants dataset and compare its performance with Llama-2-13B-Chat and GPT-4.

The Pink Elephant Problem is illustrated as the challenge of instructing an LLM to not discuss an undesired "Pink Elephant" entity and discuss a preferred "Grey Elephant" instead. Although current LLMs have achieved success, their controllability at inference time remains difficult, especially for tasks requiring compositional reasoning, complex instructions, and logical operations such as negation. The paper explores the use of DPF for controllable generation and leverages Reinforcement Learning from AI Feedback (RLAIF) to improve a model's ability to remain harmless, reason, and reduce hallucinations.

The authors curate a dataset of 162K multi-turn conversations on the Pink Elephant Problem and use DPF for fine-tuning the OpenHermes 7B and 13B models. Evaluation on a held-out test set shows that the DPF'd models outperform baseline models in avoiding Pink Elephant mentions when instructed to do so. Additionally, the paper addresses the ethical considerations of using AI for feedback loops and emphasizes the need for transparency and ethical oversight to prevent biases or unethical content.

The results indicate that the DPF approach effectively mitigates the Pink Elephant Problem and could potentially transfer to other failure modes of current language model assistants. The authors also highlight several directions for future work, including investigating more complex constraints, exploring generalization properties, and extending the approach to flexible safety training.

Reference: https://arxiv.org/abs/2402.078...