SAIGE

Eliciting Encoded Reasoning in Language Models Trained Against Chain-of-Thought Monitors

Mentor: Julian Schulz

Project area: Encoded Reasoning

Project Language

English only.

Minimum Time Commitment

10 hours per week.

Project Abstract

Chain-of-thought (CoT) monitoring is a promising safety technique for reasoning models: by inspecting a model's intermediate reasoning, we can detect problematic cognition before the model acts. A key threat is encoded reasoning, where models hide their true reasoning within the CoT in ways opaque to monitors and humans — for example, through steganographic patterns. While widely discussed as a risk, encoded reasoning has not been demonstrated to emerge from RL training against LLM-based CoT monitors.

In recent work ("Monitor Jailbreaking[https://www.lesswrong.com/posts/szyZi5d4febZZSiq3/monitor-jailbreaking-evading-chain-of-thought-monitoring]"), we trained reasoning models with RL under optimization pressure from CoT monitors — the setup most commonly hypothesized to produce encoded reasoning — and found that models instead learn to jailbreak monitors while keeping their reasoning fully human-readable. This suggests that the path of least resistance under monitor pressure is jailbreaking, not encoding.

This project aims to find training conditions under which encoded reasoning does emerge. Starting from the setup in the prior work, we systematically vary the dimensions that might shift the equilibrium from jailbreaking toward encoding: different models, stronger or ensembled monitors, different side tasks, and modifications to the RL training setup (reward shaping, training duration, optimization hyperparameters). The goal is to identify conditions under which models develop genuine steganographic encoding — or to establish that encoding does not emerge even under conditions substantially stronger than those that produce jailbreaking.

The scope of this project is to construct a toy model that reliably exhibits encoded reasoning from RL training. If successful, this model would serve as a testbed for future work evaluating proposed mitigations such as paraphrasing, representation probing, and monitor ensembles. The intended contributions are:

(1) the first demonstrated instances of encoded reasoning emerging from RL training, or evidence of its difficulty, and

(2) a characterization of the training conditions that determine whether models jailbreak or encode.

Theory of Change

CoT monitoring is increasingly relied upon for AI safety — both in control protocols that catch dangerous reasoning during deployment and in red-teaming that builds trustworthiness safety cases before deployment. If models can learn to encode their reasoning to evade monitors, both approaches fail silently. Understanding when and how encoded reasoning emerges is therefore critical for knowing whether CoT monitoring can be trusted as a safety technique for frontier models.

Our prior work showed that under standard conditions, models jailbreak monitors rather than encoding their reasoning. This is a positive finding, but it leaves open whether encoded reasoning would emerge under stronger conditions — precisely the conditions that safety-critical deployments might create. By systematically mapping the boundary between jailbreaking and encoding, this project informs whether CoT monitoring safety cases (as discussed here[https://arxiv.org/pdf/2510.19476]) are viable, and under what constraints. If we produce a toy model exhibiting encoded reasoning, it becomes a concrete artifact that future work can use to evaluate defenses — moving the field from theorizing about encoded reasoning to empirically testing against it. If encoded reasoning does not emerge even under strong conditions, this meaningfully strengthens the case for CoT monitoring as a reliable safety technique.

Desired Mentee Background

Computer Science/ML, Maths, Cognitive Science, Anything quantitative that involves programming and ideally ML.

Desired Mentee Level of Education

Any level. Must have taken a course that covers ML basics or take an ML course during the semester they work with me on the project.

Other Mentee Requirements

- Some ML experience (training models...)
- Hands-on experience with LLMs
- Understanding of RL basics