When Are Two Networks the Same? Tensor Similarity for Mechanistic Interpretability
We've found a method that tells you:How functionally similar two neural networks are across ALL inputs,Computed solely from the weights (i.e. no data),Using a principled generalization of cosine similarity.There's only one catch: you have to use a tensor network.We've already shown that tensor-transformer variants are performant (this isn't a novel claim, see these papers for MLPs and Attention), so here we're focusing on the interpretability advances. Linear Algebra Applies to TensorsA tensor n...
Read full article →