Advice for making robust-to-training model organisms

·Redwood Research··

We’d like to develop training techniques that work when applied to future misaligned AI systems. One strategy for studying proposed techniques is to test them on model organisms. However, model organisms built with common techniques are often fragile: we (and other researchers like Roger et al. and Ryd et al.) have observed them to stop misbehaving after untargeted training—training that doesn’t directly target the misbehavior. For example, we have observed that simple untargeted training method...

Read full article →

Related Articles

SFT Drives Gemini’s Safety Properties
Josh Engels · Alignment Forum · 1h ago
Sequent: scale and automation for higher confidence in alignment
Geoffrey Irving · Alignment Forum · 3d ago
A Mike's-Eye View of ARC's Research
Michael Winer · ARC · 3d ago
My research: a computational cognitive neuroscience perspective on alignment
Seth Herd · Alignment Forum · 8d ago
A Pipeline for Generating Synthetic Sabotage Trajectories to Red-Team Monitors
Myles H · LessWrong · 9d ago