.avif)
.avif)
AI Reading Group | Persona Features Control Emergent Misalignment


Persona Features Control Emergent Misalignment
Miles Wang, Tom Dupre la Tour, Olivia Watkins, Aleksandar Makelov, Ryan Andrew Chi, Samuel Miserendino, Jeffrey George Wang, Achyuta Rajaram et al.
We start the latest season of the Bliss Reading Group with 3 papers on Alignment in AI hosted by Jonas Loos and Tom Neuhäuser, beginning with Persona Features Control Emergent Misalignment from Wang, et al (2025).
Building on the earlier discovery of "emergent misalignment", Wang et al. dig into why this happens. Using sparse autoencoders to compare model internals before and after fine-tuning, they identify a set of "misaligned persona" features in activation space which appear to control this behaviour.
The paper raises compelling questions for discussion: Why does narrow bad-advice training activate these broad persona features? How reliable are SAE-based mitigations in practice? And what does this tell us about the internal "character" that models develop through training?
🎟️ Register below!
Become a part of the AI Campus.
There are many ways to join our community. Sign up to our newsletter below, or select one of the other two options and get in touch with us:

.avif)
.avif)
