Outcome-Based Distillation for Jailbreaking Safety Guardrails
Mentor: Alexander Panfilov
Project area: LLM Red-Teaming, AI Control
Project Language
English only.
Minimum Time Commitment
12 hours per week.
Project Abstract
As model capabilities increase, safety guardrails are increasingly deployed to prevent malicious actors from extracting harmful information. While some mechanisms, such as model-level refusals, provide a rich feedback signal to an attacker, black-box input-level filters typically expose only a binary outcome. Recent work, “Boundary Point Jailbreaking of Black-Box LLMs,” demonstrated a fully automated attack pipeline against input filters, but requires up to 160K harmful queries before succeeding. In this project, we propose to frame guardrail jailbreaking as an outcome-based knowledge distillation problem, where the attacker iteratively approximates the guardrail by fine-tuning an off-the-shelf LLM-based classifier on observed outcomes. The goal of the project is to reduce the total harmful query budget required to successfully jailbreak a guardrail under this setting, and to study how this budget empirically depends on the level of information exposure provided by the guardrail.
Theory of Change
As LLM capabilities increase, successful adversarial attacks can pose more severe risks: effective jailbreaks may significantly amplify malicious actors’ ability to carry out harmful activities (e.g., accessing bioweapon-related knowledge) or, in extreme, undermine broader alignment of a model.
Existing results on fully black-box attacks against model guardrails may overstate the practical difficulty faced by real-world attackers, suggesting an unrealistically high attack cost. In practice, attack cost may be substantially lower.
By systematically identifying and responsibly disclosing vulnerabilities in current guardrail systems, this line of work aims to provide model developers with actionable evidence of realistic threat levels, enabling them to patch weaknesses before they are exploited in the wild.
Desired Mentee Background
Computer Science/ML, Maths, Cognitive Science, Economics, Anything quantitative that involves programming and ideally ML.
Desired Mentee Level of Education
Any level.
Other Mentee Requirements
Being able to understand and explain _a_ jailbreaking method, e.g., GCG (https://arxiv.org/abs/2307.15043)