Analysis of Metastable States in the Transformer Activation Space
Part 1: Do Metastable Token Clusters exist in Trained Transformers? This is the first entry in a sequence. Over about ten parts, this series will work through a few humble experiments that test a mathematical theory of attention against real trained transformers.A project summary: a recent paper by Geshkovski, Letrouit, Polyanskiy, and Rigollet models attention as a dynamical system on the sphere and proves that tokens cluster and drift toward consensus, with a metastable two-timescale structure...
Read full article →