Investigating the consequences of accidentally grading CoT during RL

·LessWrong··

This is an unofficial automated linkpost. Monitoring our models’ chains of thought (CoT) has proven to be an effective way to detect and track model misalignment, both during RL training and deployment. While CoT monitoring has been useful for safety, we and many others in the industry believe CoT monitorability could be fragile. We would like to preserve and leverage CoT monitorability for as long as possible, and we recently introduced a suite of evaluations designed to measure it. Directly gr...

Read full article →

Related Articles

An OpenAI model has disproved a central conjecture in discrete geometry
tedsanders · Hacker News · 20h ago
GitHub confirms breach of 3,800 repos via malicious VSCode extension
Timofeibu · Hacker News · 1d ago
Show HN: Rmux – A programmable terminal multiplexer with a Playwright-style SDK
shideneyu · Hacker News · 6h ago
Incident Report: May 19, 2026 – GCP Account Suspension
0xedb · Hacker News · 1d ago
Not alive, but not dead: disembodied human brains used for drug testing
Timofeibu · Hacker News · 19h ago