Features of SAEs are universal - but only up to an unknown random rotation
Features of SAEs are universal - but only up to an unknown random rotationCross-model decoder-column cosine says that two models learned the same features. Apply the SAE of one model to the activations of another, and its reconstruction score becomes negative. Why? How to fix it?Epistemic status: I am confident in the core empirical claims. The acceptance thresholds were fixed before I looked at any results, the central result replicates consistently across two model scales (a 104k-parameter toy...
Read full article →