Fungal Interpretability
Mentor: Lukas Galke Poech
Project area: Interpretability and Multi-Agent Safety
Project Language
Minimum Time Commitment
16 hours per week.
Project Abstract
Mechanistic interpretability methods typically analyze neural networks from the outside — probing, patching, or decomposing representations post hoc. This project proposes a radically different approach inspired by fungi, which form symbiotic networks with plant root systems to facilitate nutrient exchange. We develop "fungal" structures — lightweight auxiliary networks modeled after hyphal growth — that grow through a host neural network, guided by information-theoretic signals such as gradient flow, activation patterns, and mutual information between components. Like biological fungi, these structures grow selectively toward high-value regions, branch at decision points, and form persistent connections along important computational pathways. The resulting hyphal network provides a living, adaptive map of the circuits and features that drive model behavior. Key research questions include: What growth rules best recover known circuits? Can fungal structures discover interpretable features that sparse autoencoders miss? And can we formalize desiderata of good explanations — selectivity, faithfulness, stability — as ecological constraints on fungal growth? This project bridges biological network theory with mechanistic interpretability, offering a new paradigm for understanding model internals.
Theory of Change
Bad frameworks produce bad decisions. The question of machine moral status will increasingly affect AI development and governance. Currently, most people reasoning about it lack adequate conceptual tools. This matters for catastrophic risk in several ways.
Under-reaction: if AI systems develop welfare-relevant internal states and we lack frameworks to recognize this, we may create systems with misaligned interests while dismissing their signals as "mere computation." A system that experiences something like suffering under certain conditions, and whose operators dismiss this, is a system with reason to deceive.
Over-reaction: anthropomorphizing systems that lack morally relevant properties wastes attention and resources, and may constrain beneficial AI development without corresponding benefit.
Poor discourse: without shared conceptual foundations, public debate about AI consciousness polarizes between dismissive and credulous positions. Neither serves good governance.
The primer addresses these by training researchers and practitioners to reason carefully across multiple frameworks, recognize what each assumes, and navigate uncertainty without false confidence. The German focus (incorporating European philosophical traditions, piloting with German-speaking users) builds SAIGE's national infrastructure while contributing to the broader field.
Conceptual clarity is infrastructure. This project builds it.
Desired Mentee Background
Computer Science/ML, Cognitive Science.
Desired Mentee Level of Education
Any level. Must have taken a course that covers ML basics or take an ML course during the semester they work with me on the project.
Other Mentee Requirements
Familiarity with some common methods in interpretability would be advantageous, and the ability to "juggle" relevant model internals (weights, activations, gradients).