Outcome-Based Distillation for Jailbreaking Safety Guardrails
Mentor: Alexander Panfilov
Project area: LLM Red-Teaming, AI Control
Project Language
Minimum Time Commitment
12 hours per week.
Project Abstract
As model capabilities increase, safety guardrails are increasingly deployed to prevent malicious actors from extracting harmful information. While some mechanisms, such as model-level refusals, provide a rich feedback signal to an attacker, black-box input-level filters typically expose only a binary outcome. Recent work, “Boundary Point Jailbreaking of Black-Box LLMs,” demonstrated a fully automated attack pipeline against input filters, but requires up to 160K harmful queries before succeeding. In this project, we propose to frame guardrail jailbreaking as an outcome-based knowledge distillation problem, where the attacker iteratively approximates the guardrail by fine-tuning an off-the-shelf LLM-based classifier on observed outcomes. The goal of the project is to reduce the total harmful query budget required to successfully jailbreak a guardrail under this setting, and to study how this budget empirically depends on the level of information exposure provided by the guardrail.
Theory of Change
Bad frameworks produce bad decisions. The question of machine moral status will increasingly affect AI development and governance. Currently, most people reasoning about it lack adequate conceptual tools. This matters for catastrophic risk in several ways.
Under-reaction: if AI systems develop welfare-relevant internal states and we lack frameworks to recognize this, we may create systems with misaligned interests while dismissing their signals as "mere computation." A system that experiences something like suffering under certain conditions, and whose operators dismiss this, is a system with reason to deceive.
Over-reaction: anthropomorphizing systems that lack morally relevant properties wastes attention and resources, and may constrain beneficial AI development without corresponding benefit.
Poor discourse: without shared conceptual foundations, public debate about AI consciousness polarizes between dismissive and credulous positions. Neither serves good governance.
The primer addresses these by training researchers and practitioners to reason carefully across multiple frameworks, recognize what each assumes, and navigate uncertainty without false confidence. The German focus (incorporating European philosophical traditions, piloting with German-speaking users) builds SAIGE's national infrastructure while contributing to the broader field.
Conceptual clarity is infrastructure. This project builds it.
Desired Mentee Background
Computer Science/ML, Maths, Cognitive Science, Economics, Anything quantitative that involves programming and ideally ML.
Desired Mentee Level of Education
Any level.
Other Mentee Requirements
Being able to understand and explain _a_ jailbreaking method, e.g., GCG (https://arxiv.org/abs/2307.15043)