Not all features are created equal

·LessWrong··

TL;DR Recent studies by Anthropic show that LLM features extracted via mechanistic interpretability fall into distinct categories, each with different properties. However, state-of-the-art auto-interpreters fail to account for this variety. In this article, I propose AIR (Auto-Interpretability Router). AIR is a new protocol that uses a sentence embedder to identify a feature's category and, based on the category, routes the most appropriate activation examples to the auto-interpreter. Results sh...

Read full article →

Related Articles

Codex logging bug may write TBs to local SSDs
vantareed · Hacker News · 8h ago
window.showDirectoryPicker opens up a whole new world
steveharrison · Hacker News · 3h ago
Google Hits 50% IPv6
barqawiz · Hacker News · 1d ago
Investors get real-time view of UK bond market activity for the first time
monkeydust · Hacker News · 8h ago
FDA advisors unanimously vote to approve Moderna's mRNA after agency drama
worik · Hacker News · 18h ago