April 27, 2026
6:45 pm
-
8:30 pm

AI Reading Group | Persona Features Control Emergent Misalignment

No items found.
Location
AI Campus Berlin
Google Maps
Paper Image

Persona Features Control Emergent Misalignment

Miles Wang, Tom Dupre la Tour, Olivia Watkins, Aleksandar Makelov, Ryan Andrew Chi, Samuel Miserendino, Jeffrey George Wang, Achyuta Rajaram et al.

We start the latest season of the Bliss Reading Group with 3 papers on Alignment in AI hosted by Jonas Loos and Tom Neuhäuser, beginning with Persona Features Control Emergent Misalignment from Wang, et al (2025).

​Building on the earlier discovery of "emergent misalignment", Wang et al. dig into why this happens. Using sparse autoencoders to compare model internals before and after fine-tuning, they identify a set of "misaligned persona" features in activation space which appear to control this behaviour.

​The paper raises compelling questions for discussion: Why does narrow bad-advice training activate these broad persona features? How reliable are SAE-based mitigations in practice? And what does this tell us about the internal "character" that models develop through training?

​🎟️ Register below!

More events
Europe’s Hub for AI.
Europe’s Hub for AI.
Europe’s Hub for AI.
Europe’s Hub for AI.
Europe’s Hub for AI.
Europe’s Hub for AI.
Europe’s Hub for AI.
Europe’s Hub for AI.
Join us

Become a part of the AI Campus.

There are many ways to join our community. Sign up to our newsletter below, or select one of the other two options and get in touch with us:

Newsletter Signup:
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.