You Can Catch Sleeper Agents by Teaching Another Model to Imitate Them
Detecting Hidden Behaviors in LLMs via Activation-matched Finetuning — preprint, 2026. [Paper] [Code]TLDR. Given a model with some unknown, abnormal behavior (backdoors, censorship, reward hacking, ...), construct an aligned reference by training a clean model to match the suspect's residual-stream activations on a benign prompt corpus. The remaining residual concentrates exactly on such abnormal behavior: the reference extrapolates well to unseen benign prompts, but the hidden behavior and its ...
Read full article →