[Linkpost] Interpreting Language Model Parameters

Lucius Bushnaq·LessWrong·AI Safety·May 5, 2026

This is the latest work in our Parameter Decomposition agenda. We introduce a new parameter decomposition method, adVersarial Parameter Decomposition (VPD)[1] and decompose the parameters of a small[2] language model with it. VPD greatly improves on our previous techniques, Stochastic Parameter Decomposition (SPD) and Attribution-based Parameter Decomposition (APD). We think the parameter decomposition approach is now more-or-less ready to be applied at scale to models people care about.Importan...

Read full article →

[Linkpost] Interpreting Language Model Parameters

Related Articles