Not all features are created equal

enricobottazzi·LessWrong·Community·June 22, 2026

TL;DR Recent studies by Anthropic show that LLM features extracted via mechanistic interpretability fall into distinct categories, each with different properties. However, state-of-the-art auto-interpreters fail to account for this variety. In this article, I propose AIR (Auto-Interpretability Router). AIR is a new protocol that uses a sentence embedder to identify a feature's category and, based on the category, routes the most appropriate activation examples to the auto-interpreter. Results sh...

Read full article →

Not all features are created equal

Related Articles