“Did you lie?” Evaluating Lie Detectors across Model Scale and Belief-Verified Model Organisms

·LessWrong··

TL;DR. Lie detectors for LLMs could be valuable for auditing and monitoring. But evaluating them requires testbeds where the model verifiably believes the opposite of what it says, which isn’t straightforward. We determine that most existing trained model organisms don't clear this bar. We train 13 reasoning model organisms, with evidence they hold the alternative belief in chain-of-thought, as well as evidence that they have generalised out of distribution. We also build a broad prompted-lying ...

Read full article →

Related Articles

Apple is about to make Hide My Email useless
SXX · Hacker News · 1d ago
TIL: You can make HTTP requests without curl using Bash /dev/TCP
mrshu · Hacker News · 1d ago
A backdoor in a LinkedIn job offer
lwhsiao · Hacker News · 2d ago
Show HN: High-Res Neural Cellular Automata
esychology · Hacker News · 12h ago
Mechanical Watch (2022)
razin · Hacker News · 1d ago