SAIGE Pivot Track

Eliciting Encoded Reasoning in Language Models Trained Against Chain-of-Thought Monitors

Mentor: Julian Schulz

Project area: Encoded Reasoning

Project Language

English only.

Forschungsmanagement ist oft ein Engpass: Diese Rollen sind schwer zu besetzen, da sie sowohl Vertrautheit mit der AI Safety-Forschung als auch starke zwischenmenschliche Fähigkeiten und Managementerfahrung erfordern. Zudem wollen wirkungsorientierte Menschen, die sich für AI Safety interessieren, meist selbst forschen – anstatt die Forschung anderer zu managen! Entscheidend ist jedoch: Du musst oft selbst nicht in der Forschung brillieren, um exzellent im Forschungsmanagement zu sein. Menschen mit Erfahrung als Projektmanager:innen, People Manager:innen und Executive Coaches eignen sich oft hervorragend dafür.

Es mangelt an Führungskräften: Das technische Feld der AI Safety könnte sehr von mehr Menschen mit Hintergrund in Strategie, Management und Operations profitieren. Wenn du Erfahrung darin hast, ein Team von mehr als 30 Personen zu leiten und weiterzuentwickeln, könntest du in einer führenden AI Safety-Organisation einen großen Unterschied machen – auch wenn du wenig direkte Erfahrung mit KI hast.

Wir brauchen Gründer:innen, Gestalter:innen des Ökosystems und Kommunikator:innen: Es gibt viel Raum, um neue Organisationen zu gründen und das Ökosystem zu erweitern. Zudem gibt es viel verfügbare Finanzierung, besonders im gewinnorientierten Bereich für AI Interpretability und Sicherheit. Unsere Arbeit am Job Board profitiert ebenfalls davon, wenn Leute neue Organisationen starten: Sie schaffen neue Rollen, auf die wir unsere Nutzer:innen vermitteln können!

Wir brauchen mehr Berufserfahrene: Da immer mehr Arbeit an KI delegiert wird, sind wir zunehmend auf erfahrene Manager:innen angewiesen. Sie können KI-generierte Ergebnisse (Outputs) überwachen, andere im Umgang mit KI-Tools schulen und Teams aus Menschen und KIs koordinieren.

Wir brauchen Menschen, die sich für „Support“-Rollen begeistern: Es mag weniger aufregend wirken, nicht direkt an den Kernproblemen zu arbeiten. Doch gerade in Rollen wie Operations und Management vervielfachst du den Impact anderer. Diese Bereiche werden oft vernachlässigt, obwohl sie sehr wirkungsvoll sind. Und als jemand, dessen Job es ist, anderen zu Jobs zu verhelfen, finde ich diese Art von Arbeit ziemlich spannend!

Minimum Time Commitment

10 hours per week.

Project Abstract

Chain-of-thought (CoT) monitoring is a promising safety technique for reasoning models: by inspecting a model's intermediate reasoning, we can detect problematic cognition before the model acts. A key threat is encoded reasoning, where models hide their true reasoning within the CoT in ways opaque to monitors and humans — for example, through steganographic patterns. While widely discussed as a risk, encoded reasoning has not been demonstrated to emerge from RL training against LLM-based CoT monitors.

In recent work ("Monitor Jailbreaking[https://www.lesswrong.com/posts/szyZi5d4febZZSiq3/monitor-jailbreaking-evading-chain-of-thought-monitoring]"), we trained reasoning models with RL under optimization pressure from CoT monitors — the setup most commonly hypothesized to produce encoded reasoning — and found that models instead learn to jailbreak monitors while keeping their reasoning fully human-readable. This suggests that the path of least resistance under monitor pressure is jailbreaking, not encoding.

This project aims to find training conditions under which encoded reasoning does emerge. Starting from the setup in the prior work, we systematically vary the dimensions that might shift the equilibrium from jailbreaking toward encoding: different models, stronger or ensembled monitors, different side tasks, and modifications to the RL training setup (reward shaping, training duration, optimization hyperparameters). The goal is to identify conditions under which models develop genuine steganographic encoding — or to establish that encoding does not emerge even under conditions substantially stronger than those that produce jailbreaking.

The scope of this project is to construct a toy model that reliably exhibits encoded reasoning from RL training. If successful, this model would serve as a testbed for future work evaluating proposed mitigations such as paraphrasing, representation probing, and monitor ensembles. The intended contributions are:

(1) the first demonstrated instances of encoded reasoning emerging from RL training, or evidence of its difficulty, and

(2) a characterization of the training conditions that determine whether models jailbreak or encode.

Theory of Change

Bad frameworks produce bad decisions. The question of machine moral status will increasingly affect AI development and governance. Currently, most people reasoning about it lack adequate conceptual tools. This matters for catastrophic risk in several ways.

Under-reaction: if AI systems develop welfare-relevant internal states and we lack frameworks to recognize this, we may create systems with misaligned interests while dismissing their signals as "mere computation." A system that experiences something like suffering under certain conditions, and whose operators dismiss this, is a system with reason to deceive.

Over-reaction: anthropomorphizing systems that lack morally relevant properties wastes attention and resources, and may constrain beneficial AI development without corresponding benefit.

Poor discourse: without shared conceptual foundations, public debate about AI consciousness polarizes between dismissive and credulous positions. Neither serves good governance.

The primer addresses these by training researchers and practitioners to reason carefully across multiple frameworks, recognize what each assumes, and navigate uncertainty without false confidence. The German focus (incorporating European philosophical traditions, piloting with German-speaking users) builds SAIGE's national infrastructure while contributing to the broader field.

Conceptual clarity is infrastructure. This project builds it.

Desired Mentee Background

Computer Science/ML, Maths, Cognitive Science, Anything quantitative that involves programming and ideally ML.

Desired Mentee Level of Education

Any level. Must have taken a course that covers ML basics or take an ML course during the semester they work with me on the project.

Other Mentee Requirements

- Some ML experience (training models...)
- Hands-on experience with LLMs
- Understanding of RL basics