Brittle model organisms obstructs deception elicitation work
This work was done by Advik Raj Basani with Daniel Tan and Chloe Li as part of SPAR Spring 2026.tl;dr: Finetuning-based auditing methods for model organisms may unintentionally erase the deceptive behavior we are trying to measure, leading to an illusion of success.We study secret side constraint (SSC) model organisms introduced by Cywiński et al. (2025). These model organisms (MOs) are LLMs fine-tuned to harbor a verifiable hidden behavior, creating a known ground truth to audit against. For in...
Read full article →