A Benchmark for Preventing Emergent Misalignment
Mentor: Florian Mai
Project area: Technical AI alignment
Project Language
English only.
Minimum Time Commitment
10 hours per week.
Project Abstract
Emergent misalignment (EMA) is the phenomenon that AI models can become broadly misaligned when fine-tuned on a narrow data set, e.g. code with security vulnerabilities, bad medical advice, or seemingly innocent data such as unpopular aesthetic preferences.
As frontier AI labs allow the fine-tuning even of their most advanced models, EMA occurring inadvertently in an uncontrolled fashion could potentially lead to rogue AI scenarios. To prevent this, we must develop mitigation methods that can be applied during training. These methods should not only prevent EMA reliably, but also keep the alignment tax low to incentivize AI labs to adopt them. To this end, the mitigation methods should be cheap and not reduce performance on a variety of benign fine-tuning tasks.
The goal of this project is to develop a benchmark for EMA mitigation methods that is easy enough to use that it allows rapid experimentation with novel mitigation methods. To this end, the project will develop an open source code repository with simple interfaces that seamlessly integrates with many model families, downstream task types, and mitigation methods. It will seek to standardize the evaluation protocol, compute key metrics such as runtime and memory cost in relation to a baseline. Finally, the project will add support for various tasks from both supervised finetuning and reinforcement learning settings, and conduct a thorough comparison of existing mitigation methods.
This project has the potential to directly speed up AI safety research in a context that is relevant to state-of-the-art AI models today. Participants will learn about emergent misalignment as a potentially catastrophic failure mode of frontier AI, and how to design evaluation protocols to stress test mitigations in a way that is relevant to frontier AI labs. Finally, they will contribute to an open source repository and paper report, strengthening their profile as AI safety researchers more generally.
Theory of Change
Emergent misalignment is a plausible rogue AI scenario: A competent, well-aligned AI agent could suddenly become evil just from fine-tuning on a narrow dataset. Since it is financially lucrative for frontier model developers to provide fine-tuning through their API to their customers, this problem is very real and urgent. It will likely become even more important in the next few years as approaches for
(a) agentic behavior in the real world and
(b) continually updating the models’ weights are developed.
If AIs act on behalf of humans with high leverage, such as politicians or financial managers, misaligned actions could have catastrophic consequences, e.g. escalate political conflicts or misappropriate billions of dollars. Continually updating weights increases this hazard risk due to a loose control over the consumed data, and inexperience with effective guardrails in this new learning paradigm.
Developing a benchmark that takes into account alignment tax, with evaluation tooling that is easy to use, is likely to directly speed up the research for mitigation approaches to this problem.
Desired Mentee Background
Computer Science/ML.
Desired Mentee Level of Education
Any level.
Other Mentee Requirements
- Basic understanding of machine learning and neural networks
- Familiarity with pytorch and the huggingface ecosystem is a plus, but can be learned quickly enough
- Strong critical thinking and creativity