SAIGE

Fungal Interpretability

Mentor: Lukas Galke Poech

Project area: Interpretability and Multi-Agent Safety

Project Language

English or German.

Minimum Time Commitment

16 hours per week.

Project Abstract

Mechanistic interpretability methods typically analyze neural networks from the outside — probing, patching, or decomposing representations post hoc. This project proposes a radically different approach inspired by fungi, which form symbiotic networks with plant root systems to facilitate nutrient exchange. We develop "fungal" structures — lightweight auxiliary networks modeled after hyphal growth — that grow through a host neural network, guided by information-theoretic signals such as gradient flow, activation patterns, and mutual information between components. Like biological fungi, these structures grow selectively toward high-value regions, branch at decision points, and form persistent connections along important computational pathways. The resulting hyphal network provides a living, adaptive map of the circuits and features that drive model behavior. Key research questions include: What growth rules best recover known circuits? Can fungal structures discover interpretable features that sparse autoencoders miss? And can we formalize desiderata of good explanations — selectivity, faithfulness, stability — as ecological constraints on fungal growth? This project bridges biological network theory with mechanistic interpretability, offering a new paradigm for understanding model internals.

Theory of Change

As AI systems become more capable, our ability to understand their internal computations becomes critical for preventing catastrophic failures. Current interpretability methods -- such as sparse autoencoders or activation patching -- provide static snapshots of model behavior, but struggle to scale to frontier models with billions of parameters and increasingly complex internal circuits.

Fungal interpretability addresses this scaling challenge through adaptive, resource-efficient exploration of model internals. Rather than exhaustively analyzing every component, fungal structures grow selectively toward the most consequential computational pathways -- precisely those most relevant to dangerous capabilities or deceptive behavior. This targeted approach could enable safety researchers to efficiently identify circuits responsible for hazardous capabilities (e.g., manipulation, deception, or scheming) in models too large for comprehensive analysis.

Furthermore, the persistent, evolving nature of fungal structures enables continuous monitoring: as models are fine-tuned or deployed in new contexts, the fungal network adapts, flagging changes in critical circuits before they manifest as dangerous behavior. This contributes directly to the agenda of scalable oversight -- maintaining interpretability-based safety guarantees even as systems grow beyond human-scale comprehension. Ultimately, this work advances the goal of principled, interpretability-informed control of transformative AI systems.

Desired Mentee Background

Computer Science/ML, Cognitive Science.

Desired Mentee Level of Education

Any level. Must have taken a course that covers ML basics or take an ML course during the semester they work with me on the project.

Other Mentee Requirements

Familiarity with some common methods in interpretability would be advantageous, and the ability to "juggle" relevant model internals (weights, activations, gradients).