[paper] Training on Documents About Monitoring Leads to CoT Obfuscation
Authors: Reilly Haskins*, Bilal Chughtai**, Joshua Engels*** primary contributor** advice and mentorshipThis is the updated version of our earlier preliminary results post, covering the final results from our paper. The paper extends our preliminary work to eight models, a harder agentic task, CoT controllability analysis, and RL experiments.TL;DR: We use synthetic document finetuning (SDF) to give models knowledge that their chain-of-thought is being monitored. Models trained on these documents...
Read full article →