KV-Fold enables long-context LLM inference without retraining

By PulseAugur Editorial · Summary by gemini-2.5-flash-lite from 1 source

Researchers have developed KV-Fold, a novel method for extending the context window of large language models without requiring retraining. This technique treats the key-value cache as an accumulator in a functional programming-style fold, allowing the model to process sequential chunks of data while maintaining a stable internal state. KV-Fold has demonstrated 100% exact-match retrieval on needle-in-a-haystack benchmarks across various context lengths and model sizes, operating within the memory constraints of a single GPU. AI

Summary written by gemini-2.5-flash-lite from 1 source. How we write summaries →

IMPACT Enables LLMs to process significantly longer contexts without costly retraining, potentially improving performance on tasks requiring extensive background information.

RANK_REASON The cluster contains an academic paper detailing a new method for LLM inference. [lever_c_demoted from research: ic=1 ai=1.0]

Read on arXiv cs.AI →

COVERAGE [1]

arXiv cs.AI TIER_1 · Alvaro Velasquez · 2026-05-12 17:53

KV-Fold: One-Step KV-Cache Recurrence for Long-Context Inference

We introduce KV-Fold, a simple, training-free long-context inference protocol that treats the key-value (KV) cache as the accumulator in a left fold over sequence chunks. At each step, the model processes the next chunk conditioned on the accumulated cache, appends the newly prod…

COVERAGE [1]

KV-Fold: One-Step KV-Cache Recurrence for Long-Context Inference

RELATED ENTITIES

RELATED TOPICS