I made a kernel 2.2x faster. It made my training loop 3x slower

vishal-padia·Hacker News·Community·June 2, 2026

I wrote a fused decode-attention kernel for an RL training loop, got it 2.2× faster than the SDPA path it replaces at the microbenchmark level, dropped it in...

Read full article →

I made a kernel 2.2x faster. It made my training loop 3x slower

Related Articles