Blind deep-deployment evals for control & sabotage

·LessWrong··

Thanks to Ezra Newman for initial ideation and various people at Apollo Research for feedback. This short personal piece does not necessarily reflect the views of Apollo Research.AI labs are preparing to automate their internal staff over the next year. Right now, control and sabotage evals try to estimate the safety of these internal deployments under adversarial pressure from power-seeking agents. Third party auditors currently don’t have access to these real internal systems, so they have to ...

Read full article →

Related Articles

Probing the loss-band sparsity assumption in Scientist AI
Alejandro Tlaie · LessWrong · 53m ago
SFT Drives Gemini’s Safety Properties
Josh Engels · Alignment Forum · 22d ago
Sequent: scale and automation for higher confidence in alignment
Geoffrey Irving · Alignment Forum · 25d ago
A Mike's-Eye View of ARC's Research
Michael Winer · ARC · 25d ago
My research: a computational cognitive neuroscience perspective on alignment
Seth Herd · Alignment Forum · 1mo ago