Control Debt

·LessWrong··

Notes on the gap: what control evaluations assume <> implementation in labs.It is 2027, and a frontier lab grew suspicions: plausibly, their model is scheming. Not a surprise for the control team. For more than a year, they worked on a protocol. Trusted monitoring is tested on their benchmark setting, with all agent actions, as well as with suspiciousness-based defer-to-trusted triggers, thresholds from the red-teaming policy, and human escalation in higher risks. In simulation, the safety/usefu...

Read full article →

Related Articles

“Beyond the limit”: Satellites and mirrors in space pose threat to the night sky
Breadmaker · Hacker News · 1d ago
Solar rail could become common in Europe after successful trial in Switzerland
neilfrndes · Hacker News · 3h ago
GPT-5.5 Codex reasoning-token clustering may be leading to degraded performance
maille · Hacker News · 20h ago
Potential session/cache leakage between workspace instances or consumer accounts
chatmasta · Hacker News · 1d ago
Show HN: KiCad in the Browser
ViktorEE · Hacker News · 5h ago