Logits as a new monitor for evaluation awareness

·LessWrong··

TL;DR:We build a logit monitor for eval awareness: throughout the CoT, we estimate an LLM's probability of producing an eval-aware sentence.The logit monitor outperforms LLM judge monitoring of verbalized eval awareness, using 10× to 100× fewer rollouts, on Kimi K2.5 and Qwen 3 32B, across two tasks:separating evaluation prompts (Fortress & Petri) from deployment prompts (WildChat)predicting which evaluation prompts will elicit rollouts with verbalized eval awareness.The logit monitor works for ...

Read full article →

Related Articles

US bans differential privacy in Census data
nl · Hacker News · 3h ago
Arch Linux Now Believes Malware Incident Under Control: More Than 1,500 Packages
qwertox · Hacker News · 5h ago
Twenty One Zero-Days in FFmpeg
redbell · Hacker News · 18h ago
CRISPR tech selectively shreds cancer cells, including "undruggable" cancers
gmays · Hacker News · 1d ago
Kimi K2.7-Code: open-source coding model with better token efficiency
nekofneko · Hacker News · 1d ago