Investigating the consequences of accidentally grading CoT during RL

papetoast·LessWrong·community·May 8, 2026

This is an unofficial automated linkpost. Monitoring our models’ chains of thought (CoT) has proven to be an effective way to detect and track model misalignment, both during RL training and deployment. While CoT monitoring has been useful for safety, we and many others in the industry believe CoT monitorability could be fragile. We would like to preserve and leverage CoT monitorability for as long as possible, and we recently introduced a suite of evaluations designed to measure it. Directly gr...

Read full article →

Investigating the consequences of accidentally grading CoT during RL

Related Articles