MHC Interp #1: Previous-Token Heads Become Attention Sinks Under Manifold-Constrained Hyper-Connections

·LessWrong··

Background:Manifold-Constrained Hyper-Connections (mHC) is a new architecture added by Deepseek and recently implemented in Deepseek v4.mHC is a fix that makes HC(Hyper-Connections) vanishing or exploding gradient caused by HC while still keeping the performance increases. As adding weights and biases on HC made signals from earlier layers harder to update making the residual stream less residual streamy.HC is a cursed method of adding weights and biases onto the residual stream to simulate a wi...

Read full article →

Related Articles

Accelerating Gemma 4: faster inference with multi-token prediction drafters
amrrs · Hacker News · 3d ago
ProgramBench: Can language models rebuild programs from scratch?
jonbaer · Hacker News · 1d ago
ZAYA1-8B matches DeepSeek-R1 on math with less than 1B active parameters
steveharing1 · Hacker News · 1d ago
OpenAI’s o1 correctly diagnosed 67% of ER patients vs. 50-55% by triage doctors
donsupreme · Hacker News · 6d ago
A couple million lines of Haskell: Production engineering at Mercury
unignorant · Hacker News · 6d ago