Real-time LLM Inference on Standard GPUs: 3k tokens/s per request

·Hacker News··

Today, Kog AI launches a tech preview of the Kog Inference Engine (KIE): 3,000 output tokens/s per request on 8× AMD MI300X GPUs and 2,100 on 8× NVIDIA H200 (FP16, no speculative decoding). This preview runs a 2B model, with support for large third-party MoE models coming next at similar speeds.

Read full article →

Related Articles

US bans differential privacy in Census data
nl · Hacker News · 2h ago
Arch Linux Now Believes Malware Incident Under Control: More Than 1,500 Packages
qwertox · Hacker News · 4h ago
Twenty One Zero-Days in FFmpeg
redbell · Hacker News · 18h ago
CRISPR tech selectively shreds cancer cells, including "undruggable" cancers
gmays · Hacker News · 1d ago
Kimi K2.7-Code: open-source coding model with better token efficiency
nekofneko · Hacker News · 1d ago