Scheming Evals Mislead in Both Directions

·LessWrong··

We spent several weeks measuring in-context scheming, the behavior where a model covertly pursues a misaligned goal while outwardly appearing to comply, and the result that ended up surprising us had very little to do with whether models scheme and almost everything to do with whether we could believe our own instruments. Two of the behavioral detectors that this field routinely relies on gave us confidently wrong answers inside the same project, one of them by manufacturing a dramatic signal th...

Read full article →

Related Articles

Since Linux 6.9, LUKS suspend stopped wiping disk-encryption keys from memory
IngoBlechschmid · Hacker News · 23h ago
Department of Commerce has lifted export controls on Claude Fable 5 and Mythos 5
Pragmata · Hacker News · 2d ago
Claude Code is steganographically marking requests
kirushik · Hacker News · 2d ago
crustc: entirety of `rustc`, translated to C
Philpax · Hacker News · 16h ago
Postgres transactions are a distributed systems superpower
KraftyOne · Hacker News · 20h ago