AI Reading Group | Persona Features Control Emergent Misalignment

Name: AI Reading Group | Persona Features Control Emergent Misalignment
Start: 2026-04-27T16:45:00.000Z
End: 2026-04-27T18:30:00.000Z
Location: AI Campus Berlin

No items found.

Location

AI Campus Berlin

Google Maps

Persona Features Control Emergent Misalignment

Miles Wang, Tom Dupre la Tour, Olivia Watkins, Aleksandar Makelov, Ryan Andrew Chi, Samuel Miserendino, Jeffrey George Wang, Achyuta Rajaram et al.

‍

We start the latest season of the Bliss Reading Group with 3 papers on Alignment in AI hosted by Jonas Loos and Tom Neuhäuser, beginning with Persona Features Control Emergent Misalignment from Wang, et al (2025).

Building on the earlier discovery of "emergent misalignment", Wang et al. dig into why this happens. Using sparse autoencoders to compare model internals before and after fine-tuning, they identify a set of "misaligned persona" features in activation space which appear to control this behaviour.

The paper raises compelling questions for discussion: Why does narrow bad-advice training activate these broad persona features? How reliable are SAE-based mitigations in practice? And what does this tell us about the internal "character" that models develop through training?

🎟️ Register below!

More events