HeadQ: Model-Visible Distortion and Score-Space Correction for KV-Cache Quantization
ByPulseAugur Editorial·
Summary by None
from 11 sources
Researchers are developing several novel methods to optimize the Key-Value (KV) cache in large language models, which is a major bottleneck for long-context processing. These approaches include training models to inherently produce compressible representations (KV-CAT), manipulating latent attention space for efficient steering (Memory Inception), and employing advanced quantization techniques like int4 and spectral denoising (eOptShrinkQ, HeadQ). Additionally, new strategies like WindowQuant for multimodal models and tierKV for distributed KV cache management aim to reduce latency and memory usage, with tierKV even demonstrating faster restoration of evicted blocks than GPU cache hits.
AI
IMPACT
New KV cache optimization techniques promise significant reductions in inference latency and memory usage for LLMs, enabling longer contexts and faster processing.
RANK_REASON
Multiple research papers proposing novel techniques for KV cache optimization in LLMs.
arXiv:2605.05699v1 Announce Type: cross Abstract: KV-cache quantization is framed as a quality--latency trade-off. We show it is \emph{inverted} on Apple Silicon's unified memory: a single fused Metal kernel (sign-randomized FFT $+$ per-channel $\lambda$ $+$ per-group abs-max $+$…
arXiv:2605.05971v1 Announce Type: new Abstract: Long-context language modeling is increasingly constrained by the Key-Value (KV) cache, whose memory and decode-time access costs scale linearly with the prefix length. This bottleneck has motivated a range of context-compression me…
arXiv cs.LG
TIER_1·Andy Zeyi Liu, Michael Zhang, Ilana Greenberg, Adam Alnasser, Lucas Baker, John Sous·
arXiv:2605.06225v1 Announce Type: new Abstract: Steering large language models (LLMs) is usually done by either instruction prompting or activation steering. Prompting often gives strong control, but caches guidance tokens at every layer and can clutter long interactions; activat…
arXiv:2605.04075v1 Announce Type: new Abstract: Multimodal Large Language Models face severe challenges in computational efficiency and memory consumption due to the substantial expansion of the visual KV cache when processing long visual contexts. Existing KV cache compression m…
arXiv:2605.02905v1 Announce Type: new Abstract: We show that the key-value (KV) cache in transformer attention heads admits a natural decomposition into a low-rank \emph{shared context} component and a full-rank \emph{per-token} residual, well described by the spiked random matri…
arXiv:2605.03562v1 Announce Type: new Abstract: KV-cache quantizers usually optimize storage-space reconstruction, even though attention reads keys through logits and values through attention-weighted readout. We argue that persistent cache error should be measured in model-visib…
KV-cache quantizers usually optimize storage-space reconstruction, even though attention reads keys through logits and values through attention-weighted readout. We argue that persistent cache error should be measured in model-visible coordinates. For keys, the visible object is …
arXiv:2605.02262v1 Announce Type: new Abstract: Recently, video language models (VLMs) have been applied in various fields. However, the visual token sequence of the VLM is too long, which may cause intolerant inference latency and GPU memory usage. Existing methods propose mixed…
Recently, video language models (VLMs) have been applied in various fields. However, the visual token sequence of the VLM is too long, which may cause intolerant inference latency and GPU memory usage. Existing methods propose mixed-precision quantization to the key-value (KV) ca…
<p><em>10 experiments, 3 models, one honest verdict: the quality story is real, the speed story needs a disclaimer, and there’s a finding in the entropy data nobody talks about.</em></p><p>⏱ ~14 min read🔬 Deep Dive⚙️ LLM Inference🗜 Quantization🚀 Serving</p><figure><img alt="" src…
<h2> The Problem </h2> <p>When your GPU's KV cache fills up, inference engines evict blocks and discard them. The next request with the same prefix re-runs full prefill from scratch — quadratic in sequence length. On a 30,000-token document that's 10+ seconds, every single time t…