An Introduction to Exemplar Partitioning for Mechanistic Interpretability

·LessWrong··

Most of what we currently call "feature discovery" in language models is wrapped up in dictionary-learning methods like sparse autoencoders (SAEs) – which work, and which have been scaled to millions of features on frontier-scale models, but which bundle two distinct commitments into a single training objective: a reconstruction loss and a sparsity loss over a fixed size dictionary. Those commitments make sense if your goal is reconstructive decomposition – if you want to take an activation and ...

Read full article →

Related Articles

“Beyond the limit”: Satellites and mirrors in space pose threat to the night sky
Breadmaker · Hacker News · 1d ago
GPT-5.5 Codex reasoning-token clustering may be leading to degraded performance
maille · Hacker News · 19h ago
Potential session/cache leakage between workspace instances or consumer accounts
chatmasta · Hacker News · 1d ago
EV Batteries Are Defying Expectations After Miles
apparent · Hacker News · 10h ago
Astrophysicists Puzzle over Webb’s New Universe
jnord · Hacker News · 1d ago