Steering Directions Are Explanations, Not Handles
A direction can be interpretable, causal, and predictive (and still a bad handle for intervention)TLDR: In modern interpretability, we find “directions” inside a language model that seem to encode meaningful concepts. One such example is “This is about food.” A common next step is to steer the model in that direction, in the hopes it produces more of that concept. I show that even when a direction passes the tests we have for whether it “really” means something, the range over which the idea of ...
Read full article →