EpiCache: Episodic KV Cache Management for Long-Term Conversation on Resource-Constrained Environments

Apple ML Research··

Modern large language models (LLMs) extend context lengths to millions of tokens, enabling coherent, personalized responses grounded in long conversational history. However, the Key-Value (KV) cache grows linearly with the extended dialogue history, causing the model’s memory footprint to quickly exceed device limits. While recent KV cache compression methods attempt to reduce memory usage, most apply cache eviction after processing the entire context, incurring unbounded peak memory usage. Addi...

Read full article →

Related Articles

OpenAI’s o1 correctly diagnosed 67% of ER patients vs. 50-55% by triage doctors
donsupreme · Hacker News · 1mo ago
Accelerating Gemma 4: faster inference with multi-token prediction drafters
amrrs · Hacker News · 1mo ago
A couple million lines of Haskell: Production engineering at Mercury
unignorant · Hacker News · 1mo ago
Using “underdrawings” for accurate text and numbers
samcollins · Hacker News · 1mo ago
ProgramBench: Can language models rebuild programs from scratch?
jonbaer · Hacker News · 1mo ago