Advancing the Human Empowerment Approach to Safe AI Agents

Mentor: Jobst Heitzig
Project area: Technical AI Safety, Behavioural Game Theory, Formal Ethics, Complexity Economics


Project Language

English only.

Forschungsmanagement ist oft ein Engpass: Diese Rollen sind schwer zu besetzen, da sie sowohl Vertrautheit mit der AI Safety-Forschung als auch starke zwischenmenschliche Fähigkeiten und Managementerfahrung erfordern. Zudem wollen wirkungsorientierte Menschen, die sich für AI Safety interessieren, meist selbst forschen – anstatt die Forschung anderer zu managen! Entscheidend ist jedoch: Du musst oft selbst nicht in der Forschung brillieren, um exzellent im Forschungsmanagement zu sein. Menschen mit Erfahrung als Projektmanager:innen, People Manager:innen und Executive Coaches eignen sich oft hervorragend dafür.

Es mangelt an Führungskräften: Das technische Feld der AI Safety könnte sehr von mehr Menschen mit Hintergrund in Strategie, Management und Operations profitieren. Wenn du Erfahrung darin hast, ein Team von mehr als 30 Personen zu leiten und weiterzuentwickeln, könntest du in einer führenden AI Safety-Organisation einen großen Unterschied machen – auch wenn du wenig direkte Erfahrung mit KI hast.

Wir brauchen Gründer:innen, Gestalter:innen des Ökosystems und Kommunikator:innen: Es gibt viel Raum, um neue Organisationen zu gründen und das Ökosystem zu erweitern. Zudem gibt es viel verfügbare Finanzierung, besonders im gewinnorientierten Bereich für AI Interpretability und Sicherheit. Unsere Arbeit am Job Board profitiert ebenfalls davon, wenn Leute neue Organisationen starten: Sie schaffen neue Rollen, auf die wir unsere Nutzer:innen vermitteln können!

Wir brauchen mehr Berufserfahrene: Da immer mehr Arbeit an KI delegiert wird, sind wir zunehmend auf erfahrene Manager:innen angewiesen. Sie können KI-generierte Ergebnisse (Outputs) überwachen, andere im Umgang mit KI-Tools schulen und Teams aus Menschen und KIs koordinieren.

Wir brauchen Menschen, die sich für „Support“-Rollen begeistern: Es mag weniger aufregend wirken, nicht direkt an den Kernproblemen zu arbeiten. Doch gerade in Rollen wie Operations und Management vervielfachst du den Impact anderer. Diese Bereiche werden oft vernachlässigt, obwohl sie sehr wirkungsvoll sind. Und als jemand, dessen Job es ist, anderen zu Jobs zu verhelfen, finde ich diese Art von Arbeit ziemlich spannend!


Minimum Time Commitment

10 hours per week.

Project Abstract

Since early work by Klyubin, Polani, Salge, Du, Dragan, and others, the idea of "empowerment" is somewhat of a sleeping beauty in AI safety research. In 2025, motivated by Max Harm's idea of "Corrigibility as a singular target" and Sen's "capability approach to welfare", I have started to pursue a research agenda around the following hypothesis: AI agents with the sole purpose of managing the distribution of humans' and AI systems' power in a suitable way can not only be safe but also sufficiently beneficial to become the central paradigm for safe-by-design AI agents. I have presented this agenda in various AI safety seminars and workshops and collected valuable feedback. At the moment I run an AI Safety Camp project on it (https://docs.google.com/document/d/1eEZAlMkx-PzsaOAZDEYMBtrlMWoBZg4vnEI0yxDf-As/edit?usp=sharing).

This project is about advancing the human empowerment agenda in one or more ways, depending on the background and interests of the participants, including (in descending priority):

(1) Machine Learning and Numerical Simulation: Scale our deep learning algorithms from my current to work with larger test environments, and use them for numerical simulation experiments that assess the safety and helpfulness of the resulting AI agent behavior.

(2) LLM-Based Experiments: Perform experiments in which LLMs agents are asked to choose actions based on the developed power metrics, either by directly estimating them or by forming discrete situational models (stochastic games) and running our existing backward induction power computation algorithms on them.

(3) Mathematical Theory: Thoroughly compare the goal-attainment-based power metrics that we have earlier derived axiomatically with the alternative, goal-attainment-unrelated information-theoretic metrics proposed by Klyubin et al., the reachability metrics of Krakovna et al., and the attainable utility approach by Turner.

(4) Embedding in Social Science and Moral Philosophy: Compare and assess the empowerment approach with the capability approach to welfare, theories of bounded rationality, individual agency and (political) power, and common moral frameworks (consequentialism, deontology, virtue ethics, daoism, confucianism, buddhism).

Working paper: https://arxiv.org/abs/2508.00159v2
Evolving code-base: https://github.com/pik-gane/empo

Theory of Change

Bad frameworks produce bad decisions. The question of machine moral status will increasingly affect AI development and governance. Currently, most people reasoning about it lack adequate conceptual tools. This matters for catastrophic risk in several ways.

Under-reaction: if AI systems develop welfare-relevant internal states and we lack frameworks to recognize this, we may create systems with misaligned interests while dismissing their signals as "mere computation." A system that experiences something like suffering under certain conditions, and whose operators dismiss this, is a system with reason to deceive.

Over-reaction: anthropomorphizing systems that lack morally relevant properties wastes attention and resources, and may constrain beneficial AI development without corresponding benefit.

Poor discourse: without shared conceptual foundations, public debate about AI consciousness polarizes between dismissive and credulous positions. Neither serves good governance.

The primer addresses these by training researchers and practitioners to reason carefully across multiple frameworks, recognize what each assumes, and navigate uncertainty without false confidence. The German focus (incorporating European philosophical traditions, piloting with German-speaking users) builds SAIGE's national infrastructure while contributing to the broader field.

Conceptual clarity is infrastructure. This project builds it.

Desired Mentee Background

Computer Science/ML, Maths, Economics, Political Science, Philosophy.

Desired Mentee Level of Education

Masters and above.

Other Mentee Requirements

For activity (1): experience with RL, ideally MARL

For activity (2): experience in LLM prompting, knowledge of MDPs/SGs

For activity (3): experience in axiomatic theory / decision theory / social choice theory, knowledge of MDPs/SGs and information theory

For activity (4): knowledge of at least two of the mentioned social science or moral philosophy frameworks