Advice for making robust-to-training model organisms

Alek Westover·Redwood Research·AI Safety·May 28, 2026

We’d like to develop training techniques that work when applied to future misaligned AI systems. One strategy for studying proposed techniques is to test them on model organisms. However, model organisms built with common techniques are often fragile: we (and other researchers like Roger et al. and Ryd et al.) have observed them to stop misbehaving after untargeted training—training that doesn’t directly target the misbehavior. For example, we have observed that simple untargeted training method...

Read full article →

Advice for making robust-to-training model organisms

Related Articles