SAIGE

Detecting and Mitigating Language-Triggered Value Instability in Multilingual LLMs

Mentors: Ajay Mandyam Rangarajan, Jeyashree Krishnan, Jason Hoelscher-Obermaier
Project area: Technical AI Safety

Project Language

English only.

Minimum Time Commitment

10 hours per week.

Project Abstract

Large Language Models (LLMs) are deployed globally across linguistic and cultural contexts, yet recent evidence suggests that prompt language alone can cause dramatic shifts in expressed values and ideological positions (Rangarajan & Krishnan, 2026). Prior work demonstrates that the same model can produce contradictory stances across languages, indicating a form of language-mediated manipulation and value instability that poses risks for trust, governance, and alignment (Rangarajan & Krishnan, 2026).

This project will replicate, stress-test, and extend the “Language as a Manipulation Vector” framework (Rangarajan & Krishnan, 2026) by systematically evaluating value consistency across languages, models, prompting conditions, and additional survey instruments developed by our team beyond the World Values Survey. Using the World Values Survey (WVS) dimensions as a baseline alongside our extended surveys, we will:

1. Scale evaluation to frontier and open models (including larger parameter sizes where accessible) to test whether instability increases with capability.

2. Identify causal drivers of value shifts, including tokenizer effects, translation artifacts, cultural priors in training data, and instruction tuning.

3. Develop robustness metrics for cross-lingual value consistency and manipulation susceptibility.

4. Prototype mitigation strategies, such as consistency regularization prompts, cross-language self-critique, and value-anchoring techniques.

Methodologically, teams will build an automated multilingual evaluation pipeline, conduct controlled experiments across languages, and analyze value drift under adversarial prompting and context manipulation.

Expected outputs include:
• An open evaluation benchmark for language-triggered value instability
• Empirical analysis of manipulation risks in multilingual deployment
• Mitigation recommendations for developers and policymakers
• A research report suitable for submission to an AI Safety venue

This project directly addresses catastrophic risk pathways involving deceptive alignment, sycophancy, and large-scale persuasion systems by improving our ability to detect and constrain manipulation via language cues.

[1] Rangarajan, A. M., & Krishnan, J. (2026). (HckPrj) Language as a Manipulation Vector: Detecting Ideological Bias and Value Instability in Multilingual LLMs. Apart Research.

Theory of Change

Advanced AI systems may influence beliefs, political processes, and societal stability at scale. If models alter their expressed values depending on language, they could be exploited to deliver targeted persuasion, evade oversight, or manipulate different populations with inconsistent narratives. Prior work shows that prompt language alone can trigger substantial ideological shifts in multilingual models (Rangarajan & Krishnan, 2026), indicating a novel pathway for culturally adaptive manipulation.

By extending this framework to newer frontier models and additional survey instruments developed by our team beyond the World Values Survey, this project reduces catastrophic risk in three ways:

1. Detection: Produces scalable methods to identify language-triggered value instability and manipulation risks across diverse evaluation frameworks.

2. Accountability: Enables auditors, developers, and policymakers to test cross-lingual consistency and detect covert ideological steering in globally deployed systems.

3. Mitigation: Develops practical interventions that stabilize model behavior across languages and cultural contexts.

This work targets a neglected failure mode at the intersection of technical alignment and governance: culturally adaptive manipulation without explicit malicious intent. Addressing it strengthens defenses against AI-enabled information warfare, polarization, and erosion of democratic processes.

Improving cross-lingual value stability contributes to trustworthy AI systems that behave consistently regardless of user language, reducing the risk that advanced models become tools for large-scale manipulation.

Desired Mentee Background

Computer Science/ML, Maths, Anything quantitative that involves programming and ideally ML.

Desired Mentee Level of Education

Masters and above.

Other Mentee Requirements

We welcome applicants from technical, interdisciplinary, or policy backgrounds with a strong interest in AI Safety. Prior experience in multilingual NLP or alignment research is beneficial but not strictly required.

Required competencies:
• Proficiency in Python and familiarity with machine learning workflows
• Ability to work with APIs and open-source LLMs (e.g., Hugging Face ecosystem)
• Strong analytical skills and comfort working with experimental data
• Ability to read and implement research papers

Preferred (but not required):
• Experience in evaluation of LLMs
• Familiarity with prompting techniques, benchmarking, or red-teaming
• Knowledge of statistics or experimental design
• Interest in AI alignment, safety, governance, or societal impacts
• Experience with survey design, social science methods, or cross-cultural research

Language and domain considerations:
• Fluency in English required
• Proficiency in at least one additional language (especially German, French, Chinese, Korean, or other widely spoken languages) is highly valuable due to the multilingual nature of the project

Collaboration and commitment:
• Ability to commit at least 8–10 hours per week (more preferred for research-focused roles)
• Willingness to work in a small, collaborative team
• Comfort with independent problem solving in an open-ended research setting

Location:
• Open to mentees based in Germany (preferred by the program). Remote participation is possible; no specific city requirement.