A Pipeline for Generating Synthetic Sabotage Trajectories to Red-Team Monitors
This post describes a proof-of-concept pipeline that turns a dataset of benign Claude Code transcripts into synthetic sabotage trajectories (strajs) for automated red-teaming of monitors. This is an approach control teams at AI companies could implement to stress test and improve their monitors. Third party auditors could also find this useful for accelerating their internal red teaming of AI companies' monitoring infrastructure such as what David Rein at METR recently did.The dataset of Claude ...
Read full article →