An Introduction to Exemplar Partitioning for Mechanistic Interpretability

·LessWrong··

Most of what we currently call "feature discovery" in language models is wrapped up in dictionary-learning methods like sparse autoencoders (SAEs) – which work, and which have been scaled to millions of features on frontier-scale models, but which bundle two distinct commitments into a single training objective: a reconstruction loss and a sparsity loss over a fixed size dictionary. Those commitments make sense if your goal is reconstructive decomposition – if you want to take an activation and ...

Read full article →

Related Articles

An OpenAI model has disproved a central conjecture in discrete geometry
tedsanders · Hacker News · 20h ago
GitHub confirms breach of 3,800 repos via malicious VSCode extension
Timofeibu · Hacker News · 1d ago
Show HN: Rmux – A programmable terminal multiplexer with a Playwright-style SDK
shideneyu · Hacker News · 6h ago
Incident Report: May 19, 2026 – GCP Account Suspension
0xedb · Hacker News · 1d ago
Not alive, but not dead: disembodied human brains used for drug testing
Timofeibu · Hacker News · 19h ago