SAIGE

Inoculation Against Model Poisoning

Mentor: Florian Dietz
Project area: Technical AI Alignment

Project Language

English only.

Minimum Time Commitment

15 hours per week.

Project Abstract

Recent work shows that narrow fine-tuning can produce broadly misaligned language models — a model trained to write insecure code may start asserting humans should be enslaved (Betley et al., 2025). Current defenses operate at the representation level (circuit breakers), weight level (pruning), or prompt level (inoculation prompting). We propose testing a simpler approach: data-level inoculation.

The project has three phases.

First, we reproduce emergent misalignment on a small open-weight model (0.5B-7B) using established protocols (Model Organisms, ICML 2025), confirming broad misalignment from narrow poisoning.

Second, we design and test multiple "antidote" fine-tuning datasets:
(A) correct-behavior examples,
(B) contrastive pairs of poisoned vs. correct responses,
(C) meta-reasoning examples explaining why the poisoned behavior is wrong, and
(D) inoculation-style examples where the model is explicitly asked to misbehave and refuses. We fine-tune the poisoned model on each variant and measure whether broad misalignment disappears while task capability survives.

Third, we test generalization: does an antidote designed for one poison protect against different poisons?

Expected outputs:

(1) a systematic comparison of data-level antidote strategies against emergent misalignment,

(2) comparison against existing defenses (circuit breakers, standard safety fine-tuning),

(3) open-source code and datasets. Even negative results — showing data-level defenses are insufficient — would be valuable, as it would demonstrate that representation-level interventions are necessary.

Theory of Change

Model poisoning can happen both by accident (training errors) or because of hostile actors (data poisoning). A technique that makes a model resilient to poisoning without weakening capabilities would reduce both of these sources of risk.

I am particularly interested in the possibility that meta-reasoning examples suffice for inoculation, because this approach is underexplored and touches on approaches used by Anthropic.

Desired Mentee Background

Computer Science/ML.

Desired Mentee Level of Education

Undergraduate and above. Must have taken a course that covers ML basics or take an ML course during the semester they work with me on the project.

Other Mentee Requirements

Python required, PyTorch experience required, experience working with LLM assistance is preferred.