The safe-to-dangerous shift is a fundamental problem for eval realism; but also for measuring awareness

·LessWrong··

1) The safe-to-dangerous shift is a fundamental problem for eval realismSuppose we have a capable and potentially scheming model, and before we deploy it, we want some evidence that it won’t do anything catastrophically dangerous once we deploy it. A common approach is to use black-box alignment evaluations. However, alignment evaluations are only reassuring to the extent that the model can't reliably[1] distinguish the deployment distribution from the evaluation distribution, as it is otherwise...

Read full article →

Related Articles

An OpenAI model has disproved a central conjecture in discrete geometry
tedsanders · Hacker News · 20h ago
GitHub confirms breach of 3,800 repos via malicious VSCode extension
Timofeibu · Hacker News · 1d ago
Show HN: Rmux – A programmable terminal multiplexer with a Playwright-style SDK
shideneyu · Hacker News · 6h ago
Incident Report: May 19, 2026 – GCP Account Suspension
0xedb · Hacker News · 1d ago
Not alive, but not dead: disembodied human brains used for drug testing
Timofeibu · Hacker News · 19h ago