SAIGE Pivot Track

Detecting and Mitigating Language-Triggered Value Instability in Multilingual LLMs

Mentors: Ajay Mandyam Rangarajan, Jeyashree Krishnan, Jason Hoelscher-Obermaier
Project area: Technical AI Safety

Project Language

English only.

Forschungsmanagement ist oft ein Engpass: Diese Rollen sind schwer zu besetzen, da sie sowohl Vertrautheit mit der AI Safety-Forschung als auch starke zwischenmenschliche Fähigkeiten und Managementerfahrung erfordern. Zudem wollen wirkungsorientierte Menschen, die sich für AI Safety interessieren, meist selbst forschen – anstatt die Forschung anderer zu managen! Entscheidend ist jedoch: Du musst oft selbst nicht in der Forschung brillieren, um exzellent im Forschungsmanagement zu sein. Menschen mit Erfahrung als Projektmanager:innen, People Manager:innen und Executive Coaches eignen sich oft hervorragend dafür.

Es mangelt an Führungskräften: Das technische Feld der AI Safety könnte sehr von mehr Menschen mit Hintergrund in Strategie, Management und Operations profitieren. Wenn du Erfahrung darin hast, ein Team von mehr als 30 Personen zu leiten und weiterzuentwickeln, könntest du in einer führenden AI Safety-Organisation einen großen Unterschied machen – auch wenn du wenig direkte Erfahrung mit KI hast.

Wir brauchen Gründer:innen, Gestalter:innen des Ökosystems und Kommunikator:innen: Es gibt viel Raum, um neue Organisationen zu gründen und das Ökosystem zu erweitern. Zudem gibt es viel verfügbare Finanzierung, besonders im gewinnorientierten Bereich für AI Interpretability und Sicherheit. Unsere Arbeit am Job Board profitiert ebenfalls davon, wenn Leute neue Organisationen starten: Sie schaffen neue Rollen, auf die wir unsere Nutzer:innen vermitteln können!

Wir brauchen mehr Berufserfahrene: Da immer mehr Arbeit an KI delegiert wird, sind wir zunehmend auf erfahrene Manager:innen angewiesen. Sie können KI-generierte Ergebnisse (Outputs) überwachen, andere im Umgang mit KI-Tools schulen und Teams aus Menschen und KIs koordinieren.

Wir brauchen Menschen, die sich für „Support“-Rollen begeistern: Es mag weniger aufregend wirken, nicht direkt an den Kernproblemen zu arbeiten. Doch gerade in Rollen wie Operations und Management vervielfachst du den Impact anderer. Diese Bereiche werden oft vernachlässigt, obwohl sie sehr wirkungsvoll sind. Und als jemand, dessen Job es ist, anderen zu Jobs zu verhelfen, finde ich diese Art von Arbeit ziemlich spannend!

Minimum Time Commitment

10 hours per week.

Project Abstract

Large Language Models (LLMs) are deployed globally across linguistic and cultural contexts, yet recent evidence suggests that prompt language alone can cause dramatic shifts in expressed values and ideological positions (Rangarajan & Krishnan, 2026). Prior work demonstrates that the same model can produce contradictory stances across languages, indicating a form of language-mediated manipulation and value instability that poses risks for trust, governance, and alignment (Rangarajan & Krishnan, 2026).

This project will replicate, stress-test, and extend the “Language as a Manipulation Vector” framework (Rangarajan & Krishnan, 2026) by systematically evaluating value consistency across languages, models, prompting conditions, and additional survey instruments developed by our team beyond the World Values Survey. Using the World Values Survey (WVS) dimensions as a baseline alongside our extended surveys, we will:

1. Scale evaluation to frontier and open models (including larger parameter sizes where accessible) to test whether instability increases with capability.

2. Identify causal drivers of value shifts, including tokenizer effects, translation artifacts, cultural priors in training data, and instruction tuning.

3. Develop robustness metrics for cross-lingual value consistency and manipulation susceptibility.

4. Prototype mitigation strategies, such as consistency regularization prompts, cross-language self-critique, and value-anchoring techniques.

Methodologically, teams will build an automated multilingual evaluation pipeline, conduct controlled experiments across languages, and analyze value drift under adversarial prompting and context manipulation.

Expected outputs include:
• An open evaluation benchmark for language-triggered value instability
• Empirical analysis of manipulation risks in multilingual deployment
• Mitigation recommendations for developers and policymakers
• A research report suitable for submission to an AI Safety venue

This project directly addresses catastrophic risk pathways involving deceptive alignment, sycophancy, and large-scale persuasion systems by improving our ability to detect and constrain manipulation via language cues.

[1] Rangarajan, A. M., & Krishnan, J. (2026). (HckPrj) Language as a Manipulation Vector: Detecting Ideological Bias and Value Instability in Multilingual LLMs. Apart Research.

Theory of Change

Bad frameworks produce bad decisions. The question of machine moral status will increasingly affect AI development and governance. Currently, most people reasoning about it lack adequate conceptual tools. This matters for catastrophic risk in several ways.

Under-reaction: if AI systems develop welfare-relevant internal states and we lack frameworks to recognize this, we may create systems with misaligned interests while dismissing their signals as "mere computation." A system that experiences something like suffering under certain conditions, and whose operators dismiss this, is a system with reason to deceive.

Over-reaction: anthropomorphizing systems that lack morally relevant properties wastes attention and resources, and may constrain beneficial AI development without corresponding benefit.

Poor discourse: without shared conceptual foundations, public debate about AI consciousness polarizes between dismissive and credulous positions. Neither serves good governance.

The primer addresses these by training researchers and practitioners to reason carefully across multiple frameworks, recognize what each assumes, and navigate uncertainty without false confidence. The German focus (incorporating European philosophical traditions, piloting with German-speaking users) builds SAIGE's national infrastructure while contributing to the broader field.

Conceptual clarity is infrastructure. This project builds it.

Desired Mentee Background

Computer Science/ML, Maths, Anything quantitative that involves programming and ideally ML.

Desired Mentee Level of Education

Masters and above.

Other Mentee Requirements

We welcome applicants from technical, interdisciplinary, or policy backgrounds with a strong interest in AI Safety. Prior experience in multilingual NLP or alignment research is beneficial but not strictly required.

Required competencies:
• Proficiency in Python and familiarity with machine learning workflows
• Ability to work with APIs and open-source LLMs (e.g., Hugging Face ecosystem)
• Strong analytical skills and comfort working with experimental data
• Ability to read and implement research papers

Preferred (but not required):
• Experience in evaluation of LLMs
• Familiarity with prompting techniques, benchmarking, or red-teaming
• Knowledge of statistics or experimental design
• Interest in AI alignment, safety, governance, or societal impacts
• Experience with survey design, social science methods, or cross-cultural research

Language and domain considerations:
• Fluency in English required
• Proficiency in at least one additional language (especially German, French, Chinese, Korean, or other widely spoken languages) is highly valuable due to the multilingual nature of the project

Collaboration and commitment:
• Ability to commit at least 8–10 hours per week (more preferred for research-focused roles)
• Willingness to work in a small, collaborative team
• Comfort with independent problem solving in an open-ended research setting

Location:
• Open to mentees based in Germany (preferred by the program). Remote participation is possible; no specific city requirement.