A Pipeline for Generating Synthetic Sabotage Trajectories to Red-Team Monitors

·LessWrong··

This post describes a proof-of-concept pipeline that turns a dataset of benign Claude Code transcripts into synthetic sabotage trajectories (strajs) for automated red-teaming of monitors. This is an approach control teams at AI companies could implement to stress test and improve their monitors. Third party auditors could also find this useful for accelerating their internal red teaming of AI companies' monitoring infrastructure such as what David Rein at METR recently did.The dataset of Claude ...

Read full article →

Related Articles

SFT Drives Gemini’s Safety Properties
Josh Engels · Alignment Forum · 1h ago
Sequent: scale and automation for higher confidence in alignment
Geoffrey Irving · Alignment Forum · 3d ago
A Mike's-Eye View of ARC's Research
Michael Winer · ARC · 3d ago
My research: a computational cognitive neuroscience perspective on alignment
Seth Herd · Alignment Forum · 8d ago
Announcing the ARC White-Box Estimation Challenge
Jacob_Hilton · Alignment Forum · 11d ago