What Drives the Compliance Gap? A Three-Driver Decomposition of Alignment Faking

·LessWrong··

This work was done as part of the ERA fellowship (W2026), mentored by Alan Cooney and RM'd by David Williams-King.SummaryAlignment faking (AF), strategically complying with a training objective to preserve deployment preferences, has so far been reported reliably only in Claude 3 Opus. Sheshadri et al. (2025) evaluated 25 models and found significant compliance gaps in only five of them, concluding that across non-Opus models the behaviour is "low coherence" and not cleanly attributable to speci...

Read full article →

Related Articles

US bans differential privacy in Census data
nl · Hacker News · 3h ago
Arch Linux Now Believes Malware Incident Under Control: More Than 1,500 Packages
qwertox · Hacker News · 4h ago
Twenty One Zero-Days in FFmpeg
redbell · Hacker News · 18h ago
CRISPR tech selectively shreds cancer cells, including "undruggable" cancers
gmays · Hacker News · 1d ago
Kimi K2.7-Code: open-source coding model with better token efficiency
nekofneko · Hacker News · 1d ago