SAIGE

Advancing the Human Empowerment Approach to Safe AI Agents

Mentor: Jobst Heitzig
Project area: Technical AI Safety, Behavioural Game Theory, Formal Ethics, Complexity Economics

Project Language

English only.

Minimum Time Commitment

10 hours per week.

Project Abstract

Since early work by Klyubin, Polani, Salge, Du, Dragan, and others, the idea of "empowerment" is somewhat of a sleeping beauty in AI safety research. In 2025, motivated by Max Harm's idea of "Corrigibility as a singular target" and Sen's "capability approach to welfare", I have started to pursue a research agenda around the following hypothesis: AI agents with the sole purpose of managing the distribution of humans' and AI systems' power in a suitable way can not only be safe but also sufficiently beneficial to become the central paradigm for safe-by-design AI agents. I have presented this agenda in various AI safety seminars and workshops and collected valuable feedback. At the moment I run an AI Safety Camp project on it (https://docs.google.com/document/d/1eEZAlMkx-PzsaOAZDEYMBtrlMWoBZg4vnEI0yxDf-As/edit?usp=sharing).

This project is about advancing the human empowerment agenda in one or more ways, depending on the background and interests of the participants, including (in descending priority):

(1) Machine Learning and Numerical Simulation: Scale our deep learning algorithms from my current to work with larger test environments, and use them for numerical simulation experiments that assess the safety and helpfulness of the resulting AI agent behavior.

(2) LLM-Based Experiments: Perform experiments in which LLMs agents are asked to choose actions based on the developed power metrics, either by directly estimating them or by forming discrete situational models (stochastic games) and running our existing backward induction power computation algorithms on them.

(3) Mathematical Theory: Thoroughly compare the goal-attainment-based power metrics that we have earlier derived axiomatically with the alternative, goal-attainment-unrelated information-theoretic metrics proposed by Klyubin et al., the reachability metrics of Krakovna et al., and the attainable utility approach by Turner.

(4) Embedding in Social Science and Moral Philosophy: Compare and assess the empowerment approach with the capability approach to welfare, theories of bounded rationality, individual agency and (political) power, and common moral frameworks (consequentialism, deontology, virtue ethics, daoism, confucianism, buddhism).

Working paper: https://arxiv.org/abs/2508.00159v2
Evolving code-base: https://github.com/pik-gane/empo

Theory of Change

If successful, the pursued empowerment approach promises to prevent two major causes of catastrophic risks: over-optimization of misaligned objectives, and human disempowerment.

It would do so in the most direct way:

- By explicitly tasking the AI agent to pursue human empowerment rather than disempowerment.

- And by not using this objective function in optimization but rather treating the human power metrics as criteria for non-maximizing decision rules such as satisficing, soft-max, or quantilization above a rising floor.

Desired Mentee Background

Computer Science/ML, Maths, Economics, Political Science, Philosophy.

Desired Mentee Level of Education

Masters and above.

Other Mentee Requirements

For activity (1): experience with RL, ideally MARL

For activity (2): experience in LLM prompting, knowledge of MDPs/SGs

For activity (3): experience in axiomatic theory / decision theory / social choice theory, knowledge of MDPs/SGs and information theory

For activity (4): knowledge of at least two of the mentioned social science or moral philosophy frameworks