Steering Directions Are Explanations, Not Handles

JackYoung27·LessWrong·Community·May 26, 2026

A direction can be interpretable, causal, and predictive (and still a bad handle for intervention)TLDR: In modern interpretability, we find “directions” inside a language model that seem to encode meaningful concepts. One such example is “This is about food.” A common next step is to steer the model in that direction, in the hopes it produces more of that concept. I show that even when a direction passes the tests we have for whether it “really” means something, the range over which the idea of ...

Read full article →

Steering Directions Are Explanations, Not Handles

Related Articles