[paper] Training on Documents About Monitoring Leads to CoT Obfuscation

·LessWrong··

Authors: Reilly Haskins*, Bilal Chughtai**, Joshua Engels*** primary contributor** advice and mentorshipThis is the updated version of our earlier preliminary results post, covering the final results from our paper. The paper extends our preliminary work to eight models, a harder agentic task, CoT controllability analysis, and RL experiments.TL;DR: We use synthetic document finetuning (SDF) to give models knowledge that their chain-of-thought is being monitored. Models trained on these documents...

Read full article →

Related Articles

US bans differential privacy in Census data
nl · Hacker News · 3h ago
Arch Linux Now Believes Malware Incident Under Control: More Than 1,500 Packages
qwertox · Hacker News · 5h ago
Twenty One Zero-Days in FFmpeg
redbell · Hacker News · 18h ago
CRISPR tech selectively shreds cancer cells, including "undruggable" cancers
gmays · Hacker News · 1d ago
Kimi K2.7-Code: open-source coding model with better token efficiency
nekofneko · Hacker News · 1d ago