Investigating the consequences of accidentally grading CoT during RL

·LessWrong··

This is an unofficial automated linkpost. Monitoring our models’ chains of thought (CoT) has proven to be an effective way to detect and track model misalignment, both during RL training and deployment. While CoT monitoring has been useful for safety, we and many others in the industry believe CoT monitorability could be fragile. We would like to preserve and leverage CoT monitorability for as long as possible, and we recently introduced a suite of evaluations designed to measure it. Directly gr...

Read full article →

Related Articles

Dirtyfrag: Universal Linux LPE
flipped · Hacker News · 1d ago
A web page that shows you everything the browser told it without asking
mwheelz · Hacker News · 13h ago
DeepSeek 4 Flash local inference engine for Metal
tamnd · Hacker News · 1d ago
An Introduction to Meshtastic
ColinWright · Hacker News · 14h ago
Natural Language Autoencoders: Turning Claude's Thoughts into Text
instagraham · Hacker News · 1d ago