MHC Interp #1: Previous-Token Heads Become Attention Sinks Under Manifold-Constrained Hyper-Connections

·LessWrong··

Background:Manifold-Constrained Hyper-Connections (mHC) is a new architecture added by Deepseek and recently implemented in Deepseek v4.mHC is a fix that makes HC(Hyper-Connections) vanishing or exploding gradient caused by HC while still keeping the performance increases. As adding weights and biases on HC made signals from earlier layers harder to update making the residual stream less residual streamy.HC is a cursed method of adding weights and biases onto the residual stream to simulate a wi...

Read full article →

Related Articles

OpenAI’s o1 correctly diagnosed 67% of ER patients vs. 50-55% by triage doctors
donsupreme · Hacker News · 18d ago
Accelerating Gemma 4: faster inference with multi-token prediction drafters
amrrs · Hacker News · 15d ago
A couple million lines of Haskell: Production engineering at Mercury
unignorant · Hacker News · 18d ago
Using “underdrawings” for accurate text and numbers
samcollins · Hacker News · 19d ago
ProgramBench: Can language models rebuild programs from scratch?
jonbaer · Hacker News · 14d ago