Model Size Scaling in 2023-2031

·LessWrong··

Token generation speed is constrained by the speed at which the relevant HBM can be read, which is mostly the weights and KV-cache. Suppose a model is large, so that more than half of HBM is read when making a single pass over the weights, it's being read in parallel within a scale-up system, and N such systems are used in a pipeline. Then the time it takes to generate a token (without speculative decoding) is at least the time of reading more than half of an HBM stack times N. If we target a pa...

Read full article →

Related Articles

Nearly half of LG smart TV apps contain residential proxy SDKs
microcode · Hacker News · 4h ago
Canada plans 'nuclear renaissance' with up to 10 reactors built by 2040
geox · Hacker News · 5h ago
Codex logging bug may write TBs to local SSDs
vantareed · Hacker News · 17h ago
Chevron signs 20-year power agreement with Microsoft for West Texas data center
cdrnsf · Hacker News · 11h ago
British Columbia, Time Zones, and Postgres
sprawl_ · Hacker News · 5h ago