Model Size Scaling in 2023-2031

Vladimir_Nesov·LessWrong·Community·June 22, 2026

Token generation speed is constrained by the speed at which the relevant HBM can be read, which is mostly the weights and KV-cache. Suppose a model is large, so that more than half of HBM is read when making a single pass over the weights, it's being read in parallel within a scale-up system, and N such systems are used in a pipeline. Then the time it takes to generate a token (without speculative decoding) is at least the time of reading more than half of an HBM stack times N. If we target a pa...

Read full article →

Model Size Scaling in 2023-2031

Related Articles