Steering Directions Are Explanations, Not Handles

·LessWrong··

A direction can be interpretable, causal, and predictive (and still a bad handle for intervention)TLDR: In modern interpretability, we find “directions” inside a language model that seem to encode meaningful concepts. One such example is “This is about food.” A common next step is to steer the model in that direction, in the hopes it produces more of that concept. I show that even when a direction passes the tests we have for whether it “really” means something, the range over which the idea of ...

Read full article →

Related Articles

US bans differential privacy in Census data
nl · Hacker News · 2h ago
Arch Linux Now Believes Malware Incident Under Control: More Than 1,500 Packages
qwertox · Hacker News · 4h ago
Twenty One Zero-Days in FFmpeg
redbell · Hacker News · 18h ago
CRISPR tech selectively shreds cancer cells, including "undruggable" cancers
gmays · Hacker News · 1d ago
Kimi K2.7-Code: open-source coding model with better token efficiency
nekofneko · Hacker News · 1d ago