The Economics of Speculative Decoding
Two underexplored axes: what MoE routing does to the decode roofline, and how compressed attention takes away the slack that used to make speculated tokens free.
Read full article →Two underexplored axes: what MoE routing does to the decode roofline, and how compressed attention takes away the slack that used to make speculated tokens free.
Read full article →