Learned Chain-of-Thought Obfuscation Generalises to Unseen Tasks

·LessWrong··

TL;DRTraining against a CoT or summary-only monitor can lead to obfuscation of dangerous reasoning in unseen tasks. This strengthens the “don’t train against a monitor” claims.Figure 1. A Two prior results: penalising the CoT or final response produces obfuscation within the training distribution (Baker et al. 2025; Skaf et al. 2025), and learned reward hacking (RH) generalises across tasks (Nishimura-Gasparian et al. 2024). We combine them and find that obfuscation itself generalises: pressure ...

Read full article →

Related Articles

An OpenAI model has disproved a central conjecture in discrete geometry
tedsanders · Hacker News · 20h ago
GitHub confirms breach of 3,800 repos via malicious VSCode extension
Timofeibu · Hacker News · 1d ago
Show HN: Rmux – A programmable terminal multiplexer with a Playwright-style SDK
shideneyu · Hacker News · 6h ago
Incident Report: May 19, 2026 – GCP Account Suspension
0xedb · Hacker News · 1d ago
Not alive, but not dead: disembodied human brains used for drug testing
Timofeibu · Hacker News · 19h ago