BeamGPT: A new paradigm for attention

·LessWrong··

I have found an operator that achieves striking results in learning curves when used alongside standard attention in a nanoGPT-style character-level language model. It finds structure in the sequence that attention misses.The model learns a mix ratio of around 45% attention to 55% of the field operator. This ratio seems consistent across layers. This operator is linear in sequence length. Standard attention is quadratic. The hybrid scaling model gives roughly 2.3 savings at long context. As you ...

Read full article →

Related Articles

DSpark: Speculative decoding accelerates LLM inference [pdf]
aurenvale · Hacker News · 22h ago
Anthropic says Alibaba illicitly extracted Claude AI model capabilities
htrp · Hacker News · 3d ago
Michigan spent $1.8B and only created 602 jobs
littlexsparkee · Hacker News · 10h ago
The gap between open weights LLMs and closed source LLMs
kkm · Hacker News · 1d ago
How Many Elementary Particles Are There, Really?
rwmj · Hacker News · 19h ago