A 35B MoE on a 16 GB GPU, without the offload tax

·Hacker News··

A 33-35B MoE only fires ~8 of 256 experts per token, but holding it on the GPU costs you all of them. Luce Spark pins the experts your traffic uses, offloads the rest to CPU, and decodes the whole token in one fused graph, so a 33-35B MoE fits a 16 GB card and still runs near full-GPU speed (~100 tok/s vs ~119 all-GPU on a 3090). Self-tuning, one flag.

Read full article →

Related Articles

Should I run plain Docker Compose in production in 2026?
pmig · Hacker News · 1mo ago
Computer Use is 45x more expensive than structured APIs
palashawas · Hacker News · 1mo ago
Bun is being ported from Zig to Rust
SergeAx · Hacker News · 1mo ago
Show HN: Tilde.run – Agent sandbox with a transactional, versioned filesystem
ozkatz · Hacker News · 1mo ago
RaTeX: KaTeX-compatible LaTeX rendering engine in pure Rust
atilimcetin · Hacker News · 1mo ago