BeamGPT: A new paradigm for attention
I have found an operator that achieves striking results in learning curves when used alongside standard attention in a nanoGPT-style character-level language model. It finds structure in the sequence that attention misses.The model learns a mix ratio of around 45% attention to 55% of the field operator. This ratio seems consistent across layers. This operator is linear in sequence length. Standard attention is quadratic. The hybrid scaling model gives roughly 2.3 savings at long context. As you ...
Read full article →