Recent Developments in LLM Architectures: KV Sharing, mHC, and Compressed Attention

Sebastian Raschka, PhD·Sebastian Raschka·AI·May 16, 2026

After a short family break, I am excited to be back and catching up on a busy few weeks of open-weight LLM releases. The thing that stood out to me is how much newer architectures are focused on long-context efficiency.As reasoning models and agent workflows keep more tokens around (for longer), KV-cache size, memory traffic, and attention cost quickly become the main constraints, and LLM developers are adding a growing number of architecture tricks to reduce those costs.The main examples I want...

Read full article →

Recent Developments in LLM Architectures: KV Sharing, mHC, and Compressed Attention

Related Articles