Advancing the Human Empowerment Approach to Safe AI Agents
Mentor: Jobst Heitzig
Project area: Technical AI Safety, Behavioural Game Theory, Formal Ethics, Complexity Economics
Project Language
Minimum Time Commitment
10 hours per week.
Project Abstract
Since early work by Klyubin, Polani, Salge, Du, Dragan, and others, the idea of "empowerment" is somewhat of a sleeping beauty in AI safety research. In 2025, motivated by Max Harm's idea of "Corrigibility as a singular target" and Sen's "capability approach to welfare", I have started to pursue a research agenda around the following hypothesis: AI agents with the sole purpose of managing the distribution of humans' and AI systems' power in a suitable way can not only be safe but also sufficiently beneficial to become the central paradigm for safe-by-design AI agents. I have presented this agenda in various AI safety seminars and workshops and collected valuable feedback. At the moment I run an AI Safety Camp project on it (https://docs.google.com/document/d/1eEZAlMkx-PzsaOAZDEYMBtrlMWoBZg4vnEI0yxDf-As/edit?usp=sharing).
This project is about advancing the human empowerment agenda in one or more ways, depending on the background and interests of the participants, including (in descending priority):
(1) Machine Learning and Numerical Simulation: Scale our deep learning algorithms from my current to work with larger test environments, and use them for numerical simulation experiments that assess the safety and helpfulness of the resulting AI agent behavior.
(2) LLM-Based Experiments: Perform experiments in which LLMs agents are asked to choose actions based on the developed power metrics, either by directly estimating them or by forming discrete situational models (stochastic games) and running our existing backward induction power computation algorithms on them.
(3) Mathematical Theory: Thoroughly compare the goal-attainment-based power metrics that we have earlier derived axiomatically with the alternative, goal-attainment-unrelated information-theoretic metrics proposed by Klyubin et al., the reachability metrics of Krakovna et al., and the attainable utility approach by Turner.
(4) Embedding in Social Science and Moral Philosophy: Compare and assess the empowerment approach with the capability approach to welfare, theories of bounded rationality, individual agency and (political) power, and common moral frameworks (consequentialism, deontology, virtue ethics, daoism, confucianism, buddhism).
Working paper: https://arxiv.org/abs/2508.00159v2
Evolving code-base: https://github.com/pik-gane/empo
Theory of Change
Bad frameworks produce bad decisions. The question of machine moral status will increasingly affect AI development and governance. Currently, most people reasoning about it lack adequate conceptual tools. This matters for catastrophic risk in several ways.
Under-reaction: if AI systems develop welfare-relevant internal states and we lack frameworks to recognize this, we may create systems with misaligned interests while dismissing their signals as "mere computation." A system that experiences something like suffering under certain conditions, and whose operators dismiss this, is a system with reason to deceive.
Over-reaction: anthropomorphizing systems that lack morally relevant properties wastes attention and resources, and may constrain beneficial AI development without corresponding benefit.
Poor discourse: without shared conceptual foundations, public debate about AI consciousness polarizes between dismissive and credulous positions. Neither serves good governance.
The primer addresses these by training researchers and practitioners to reason carefully across multiple frameworks, recognize what each assumes, and navigate uncertainty without false confidence. The German focus (incorporating European philosophical traditions, piloting with German-speaking users) builds SAIGE's national infrastructure while contributing to the broader field.
Conceptual clarity is infrastructure. This project builds it.
Desired Mentee Background
Computer Science/ML, Maths, Economics, Political Science, Philosophy.
Desired Mentee Level of Education
Masters and above.
Other Mentee Requirements
For activity (1): experience with RL, ideally MARL
For activity (2): experience in LLM prompting, knowledge of MDPs/SGs
For activity (3): experience in axiomatic theory / decision theory / social choice theory, knowledge of MDPs/SGs and information theory
For activity (4): knowledge of at least two of the mentioned social science or moral philosophy frameworks