Brittle model organisms obstructs deception elicitation work

·LessWrong··

This work was done by Advik Raj Basani with Daniel Tan and Chloe Li as part of SPAR Spring 2026.tl;dr: Finetuning-based auditing methods for model organisms may unintentionally erase the deceptive behavior we are trying to measure, leading to an illusion of success.We study secret side constraint (SSC) model organisms introduced by Cywiński et al. (2025). These model organisms (MOs) are LLMs fine-tuned to harbor a verifiable hidden behavior, creating a known ground truth to audit against. For in...

Read full article →

Related Articles

Codex logging bug may write TBs to local SSDs
vantareed · Hacker News · 7h ago
window.showDirectoryPicker opens up a whole new world
steveharrison · Hacker News · 2h ago
Google Hits 50% IPv6
barqawiz · Hacker News · 1d ago
FDA advisors unanimously vote to approve Moderna's mRNA after agency drama
worik · Hacker News · 17h ago
Investors get real-time view of UK bond market activity for the first time
monkeydust · Hacker News · 7h ago