“Did you lie?” Evaluating Lie Detectors across Model Scale and Belief-Verified Model Organisms
TL;DR. Lie detectors for LLMs could be valuable for auditing and monitoring. But evaluating them requires testbeds where the model verifiably believes the opposite of what it says, which isn’t straightforward. We determine that most existing trained model organisms don't clear this bar. We train 13 reasoning model organisms, with evidence they hold the alternative belief in chain-of-thought, as well as evidence that they have generalised out of distribution. We also build a broad prompted-lying ...
Read full article →