Control Debt

Ida Caspary·LessWrong·Community·May 10, 2026

Notes on the gap: what control evaluations assume <> implementation in labs.It is 2027, and a frontier lab grew suspicions: plausibly, their model is scheming. Not a surprise for the control team. For more than a year, they worked on a protocol. Trusted monitoring is tested on their benchmark setting, with all agent actions, as well as with suspiciousness-based defer-to-trusted triggers, thresholds from the red-teaming policy, and human escalation in higher risks. In simulation, the safety/usefu...

Read full article →

Control Debt

Related Articles