[paper] Training on Documents About Monitoring Leads to CoT Obfuscation

Reilly Haskins·LessWrong·Community·May 27, 2026

Authors: Reilly Haskins*, Bilal Chughtai**, Joshua Engels*** primary contributor** advice and mentorshipThis is the updated version of our earlier preliminary results post, covering the final results from our paper. The paper extends our preliminary work to eight models, a harder agentic task, CoT controllability analysis, and RL experiments.TL;DR: We use synthetic document finetuning (SDF) to give models knowledge that their chain-of-thought is being monitored. Models trained on these documents...

Read full article →

[paper] Training on Documents About Monitoring Leads to CoT Obfuscation

Related Articles