Defeating Introspection Adapters (and Why Threat Models Matter)

Nick Merrill·LessWrong·Community·June 4, 2026

We demonstrated an attack against Introspection Adapters (Shenoy et al., 2026), a technique for detecting malicious fine-tunes. Long story short: an attacker who controls model weights can apply a cheap, output-preserving transform that relocates the basis the auditor was calibrated against. This defeats the auditor with no observable change in model behavior. 📄 Paper💻 CodeAfter we discovered this attack, asked Keshav Shenoy (the first author) what he thought. It turned out that their team had...

Read full article →

Defeating Introspection Adapters (and Why Threat Models Matter)

Related Articles