.avif)
.avif)
AI Reading Group | Adversarial Training for High-Stakes Reliability

This week we are continuing our reading group on Technical Alignment in AI, led by Craig Dickson.
Our paper this week is **Adversarial Training for High-Stakes Reliability (Ziegler et al., 2022).**
A NeurIPS paper from Redwood Research tackling extreme reliability in AI behaviors. The team took a language model tasked with never producing violent or gory outputs (“avoid describing injuries”) and used adversarial examples to stress-test it. They had another model (and humans with special tools) generate tricky prompts to make the model slip up, then trained the model on those failure cases. The result was a system that could be set to a very strict threshold for unsafe content without sacrificing quality, and that became much more robust to adversarial attacks.

In their metrics, adversarial training doubled the time it took for red-teamers to find a new exploit (from 13 to 26 minutes with tools, for example) while maintaining in-distribution performance. This paper is a concrete example of empirical alignment work: it shows how adversarial methods can patch vulnerabilities and push a model closer to “zero undesirable outputs,” which is critical for high-stakes deployments.
↓ Register below to secure your spot!
About BlissFounded in 2022, the student initiative Berlin Learning & Intelligent Systems Society (BLISS) aims to create a community of students and young professionals excited about machine learning and AI. The vision is to provide an environment to deeply engage with AI research while fostering connections to leading researchers and industry professionals. The Paper Reading Group is hosted at the Merantix AI Campus every Monday.
→ Please read the paper before attending as we will use the time to discuss the contents.
⏰ Doors close at 6:45, join us before then!
📍Merantix AI Campus, Berlin.
Become a part of the AI Campus.
There are many ways to join our community. Sign up to our newsletter below, or select one of the other two options and get in touch with us:


.avif)
