Features of SAEs are universal - but only up to an unknown random rotation

Jordan McCann·LessWrong·Community·May 31, 2026

Features of SAEs are universal - but only up to an unknown random rotationCross-model decoder-column cosine says that two models learned the same features. Apply the SAE of one model to the activations of another, and its reconstruction score becomes negative. Why? How to fix it?Epistemic status: I am confident in the core empirical claims. The acceptance thresholds were fixed before I looked at any results, the central result replicates consistently across two model scales (a 104k-parameter toy...

Read full article →

Features of SAEs are universal - but only up to an unknown random rotation

Related Articles