BeamGPT: A new paradigm for attention

zw5·LessWrong·Community·June 28, 2026

I have found an operator that achieves striking results in learning curves when used alongside standard attention in a nanoGPT-style character-level language model. It finds structure in the sequence that attention misses.The model learns a mix ratio of around 45% attention to 55% of the field operator. This ratio seems consistent across layers. This operator is linear in sequence length. Standard attention is quadratic. The hybrid scaling model gives roughly 2.3 savings at long context. As you ...

Read full article →

BeamGPT: A new paradigm for attention

Related Articles