HeadQ: Model-Visible Distortion and Score-Space Correction for KV-Cache Quantization

By PulseAugur Editorial · Summary by None from 11 sources

Researchers are developing several novel methods to optimize the Key-Value (KV) cache in large language models, which is a major bottleneck for long-context processing. These approaches include training models to inherently produce compressible representations (KV-CAT), manipulating latent attention space for efficient steering (Memory Inception), and employing advanced quantization techniques like int4 and spectral denoising (eOptShrinkQ, HeadQ). Additionally, new strategies like WindowQuant for multimodal models and tierKV for distributed KV cache management aim to reduce latency and memory usage, with tierKV even demonstrating faster restoration of evicted blocks than GPU cache hits. AI

Summary written by None from 11 sources. How we write summaries →

IMPACT New KV cache optimization techniques promise significant reductions in inference latency and memory usage for LLMs, enabling longer contexts and faster processing.

RANK_REASON Multiple research papers proposing novel techniques for KV cache optimization in LLMs.

Read on arXiv cs.AI →

paper
infra

COVERAGE [11]

arXiv cs.AI TIER_1 · Mohamed Amine Bergach · 2026-05-08 04:00

When Quantization Is Free: An int4 KV Cache That Outruns fp16 on Apple Silicon

arXiv:2605.05699v1 Announce Type: cross Abstract: KV-cache quantization is framed as a quality--latency trade-off. We show it is \emph{inverted} on Apple Silicon's unified memory: a single fused Metal kernel (sign-randomized FFT $+$ per-channel $\lambda$ $+$ per-group abs-max $+$…
arXiv cs.LG TIER_1 · Yoav Gelberg, Yam Eitan, Michael Bronstein, Yarin Gal, Haggai Maron · 2026-05-08 04:00

Training Transformers for KV Cache Compressibility

arXiv:2605.05971v1 Announce Type: new Abstract: Long-context language modeling is increasingly constrained by the Key-Value (KV) cache, whose memory and decode-time access costs scale linearly with the prefix length. This bottleneck has motivated a range of context-compression me…
arXiv cs.LG TIER_1 · Andy Zeyi Liu, Michael Zhang, Ilana Greenberg, Adam Alnasser, Lucas Baker, John Sous · 2026-05-08 04:00

Memory Inception: Latent-Space KV Cache Manipulation for Steering LLMs

arXiv:2605.06225v1 Announce Type: new Abstract: Steering large language models (LLMs) is usually done by either instruction prompting or activation steering. Prompting often gives strong control, but caches guidance tokens at every layer and can clutter long interactions; activat…
arXiv cs.LG TIER_1 · Sihao Liu, YuFan Xiong, Zhonghua Jiang, Zhaode Wang, chengfei lv Shengyu Zhang · 2026-05-07 04:00

RetentiveKV: State-Space Memory for Uncertainty-Aware Multimodal KV Cache Eviction

arXiv:2605.04075v1 Announce Type: new Abstract: Multimodal Large Language Models face severe challenges in computational efficiency and memory consumption due to the substantial expansion of the visual KV cache when processing long visual contexts. Existing KV cache compression m…
arXiv cs.LG TIER_1 · Pei-Chun Su · 2026-05-06 04:00

eOptShrinkQ: Near-Lossless KV Cache Compression Through Optimal Spectral Denoising and Quantization

arXiv:2605.02905v1 Announce Type: new Abstract: We show that the key-value (KV) cache in transformer attention heads admits a natural decomposition into a low-rank \emph{shared context} component and a full-rank \emph{per-token} residual, well described by the spiked random matri…
arXiv cs.LG TIER_1 · Jorge L. Ruiz Williams · 2026-05-06 04:00

HeadQ: Model-Visible Distortion and Score-Space Correction for KV-Cache Quantization

arXiv:2605.03562v1 Announce Type: new Abstract: KV-cache quantizers usually optimize storage-space reconstruction, even though attention reads keys through logits and values through attention-weighted readout. We argue that persistent cache error should be measured in model-visib…
arXiv cs.AI TIER_1 · Jorge L. Ruiz Williams · 2026-05-05 09:34

HeadQ: Model-Visible Distortion and Score-Space Correction for KV-Cache Quantization

KV-cache quantizers usually optimize storage-space reconstruction, even though attention reads keys through logits and values through attention-weighted readout. We argue that persistent cache error should be measured in model-visible coordinates. For keys, the visible object is …
arXiv cs.CV TIER_1 · Wei Tao, Xiaoyang Qu, Peiqiang Wang, Guokuan Li, Jiguang Wan, Kai Lu, Jianzong Wang · 2026-05-05 04:00

WindowQuant: Mixed-Precision KV Cache Quantization based on Window-Level Similarity for VLMs Inference Optimization

arXiv:2605.02262v1 Announce Type: new Abstract: Recently, video language models (VLMs) have been applied in various fields. However, the visual token sequence of the VLM is too long, which may cause intolerant inference latency and GPU memory usage. Existing methods propose mixed…
arXiv cs.CV TIER_1 · Jianzong Wang · 2026-05-04 06:17

WindowQuant: Mixed-Precision KV Cache Quantization based on Window-Level Similarity for VLMs Inference Optimization

Recently, video language models (VLMs) have been applied in various fields. However, the visual token sequence of the VLM is too long, which may cause intolerant inference latency and GPU memory usage. Existing methods propose mixed-precision quantization to the key-value (KV) ca…
Towards AI TIER_1 · Ravi Yogesh · 2026-05-09 20:01

Is 3-Bit KV Cache the Holy Grail? A Reality Check on Google’s TurboQuant

<p><em>10 experiments, 3 models, one honest verdict: the quality story is real, the speed story needs a disclaimer, and there’s a finding in the entropy data nobody talks about.</em></p><p>⏱ ~14 min read🔬 Deep Dive⚙️ LLM Inference🗜 Quantization🚀 Serving</p><figure><img alt="" src…
dev.to — LLM tag TIER_1 · prasanna kanagasabai · 2026-05-09 03:01

tierKV: A Distributed KV Cache That Makes Evicted Blocks Faster to Restore Than GPU Cache Hits

<h2> The Problem </h2> <p>When your GPU's KV cache fills up, inference engines evict blocks and discard them. The next request with the same prefix re-runs full prefill from scratch — quadratic in sequence length. On a 30,000-token document that's 10+ seconds, every single time t…

COVERAGE [11]

RELATED ENTITIES

RELATED TOPICS