Scheming Evals Mislead in Both Directions
We spent several weeks measuring in-context scheming, the behavior where a model covertly pursues a misaligned goal while outwardly appearing to comply, and the result that ended up surprising us had very little to do with whether models scheme and almost everything to do with whether we could believe our own instruments. Two of the behavioral detectors that this field routinely relies on gave us confidently wrong answers inside the same project, one of them by manufacturing a dramatic signal th...
Read full article →