A 35B MoE on a 16 GB GPU, without the offload tax
A 33-35B MoE only fires ~8 of 256 experts per token, but holding it on the GPU costs you all of them. Luce Spark pins the experts your traffic uses, offloads the rest to CPU, and decodes the whole token in one fused graph, so a 33-35B MoE fits a 16 GB card and still runs near full-GPU speed (~100 tok/s vs ~119 all-GPU on a 3090). Self-tuning, one flag.
Read full article →