SAIGE

Empirical Technical AI Safety Research (topic: flexible. See project abstract)

Mentor: Joschka Braun
Project area: Scalable Oversight, Evaluations, Steering Vectors, Exploration Hacking, AI Control

Project Language

English or German.

Minimum Time Commitment

10 hours per week.

Project Abstract

I am happy to supervise empirical work in technical AI Safety in any of the following areas: Scalable Oversight, Evaluations, Steering Vectors, Exploration Hacking (Small model organisms, propensity benchmarking), and AI Control. These are broad topics with many open questions, and given the short fellowship timespan (3 months, ~8–10 hours/week), I would encourage mentees to propose project ideas they genuinely care about and are motivated to lead. I can provide conceptual and methodological feedback, help scope projects to a feasible size, and offer guidance on experimental design, evaluation, and writing.

Given the tight time constraints, I want to avoid projects that require substantial engineering, infrastructure, or long iteration cycles—such as full RL training runs, complex SFT pipelines, or large-scale steering-vector interventions. I can only supervise such projects if mentees already have prior experience with these methods and access to the necessary compute and tooling. For the fellowship duration, I think maintaining fast iteration speed is crucial for making meaningful progress. Additionally, I prefer not to supervise projects in areas where I am not deeply familiar with the research landscape. In particular, I will not supervise projects focused on mechanistic interpretability or adversarial robustness.

Because my own plans for the summer are not fully settled, I want to be transparent that my supervision style will be light-touch but reliable. I can commit to one weekly team check-ins and timely feedback, but mentees should be comfortable driving their own project and taking initiative. In return, I aim to support them in producing a short, well-scoped empirical contribution—ideally something that could be polished into a NeurIPS 2026 workshop submission.

Overall, this “flexible topic” track is intended for mentees who are excited about technical AI safety and eager to take ownership of a small but meaningful research contribution within a focused three-month timeline. I’m happy to discuss ideas and help shape projects that fit both the fellows’ interests and the available supervisory capacity.

Theory of Change

This project will not directly mitigate catastrophic AI risks. Instead, I want to help aspiring AI safety researchers gain the experience and skills needed to eventually contribute meaningfully in the field. By guiding a small empirical project my aim is to help mentees build confidence, learn core research workflows, and clarify their interests. Ideally, this prepares them for more selective programs or early-career safety roles. The main theory of change is talent development: supporting motivated newcomers so they can move closer to positions where they can have real impact in reducing transformative AI risks.

Desired Mentee Background

Computer Science/ML, Maths, Cognitive Science.

Desired Mentee Level of Education

Any level.

Other Mentee Requirements

n/a