The safe-to-dangerous shift is a fundamental problem for eval realism; but also for measuring awareness

Charlie Griffin·LessWrong·Community·May 14, 2026

1) The safe-to-dangerous shift is a fundamental problem for eval realismSuppose we have a capable and potentially scheming model, and before we deploy it, we want some evidence that it won’t do anything catastrophically dangerous once we deploy it. A common approach is to use black-box alignment evaluations. However, alignment evaluations are only reassuring to the extent that the model can't reliably[1] distinguish the deployment distribution from the evaluation distribution, as it is otherwise...

Read full article →

The safe-to-dangerous shift is a fundamental problem for eval realism; but also for measuring awareness

Related Articles