Defeating Introspection Adapters (and Why Threat Models Matter)

·LessWrong··

We demonstrated an attack against Introspection Adapters (Shenoy et al., 2026), a technique for detecting malicious fine-tunes. Long story short: an attacker who controls model weights can apply a cheap, output-preserving transform that relocates the basis the auditor was calibrated against. This defeats the auditor with no observable change in model behavior. 📄 Paper💻 CodeAfter we discovered this attack, asked Keshav Shenoy (the first author) what he thought. It turned out that their team had...

Read full article →

Related Articles

US bans differential privacy in Census data
nl · Hacker News · 2h ago
Arch Linux Now Believes Malware Incident Under Control: More Than 1,500 Packages
qwertox · Hacker News · 4h ago
Twenty One Zero-Days in FFmpeg
redbell · Hacker News · 18h ago
CRISPR tech selectively shreds cancer cells, including "undruggable" cancers
gmays · Hacker News · 1d ago
Kimi K2.7-Code: open-source coding model with better token efficiency
nekofneko · Hacker News · 1d ago