KV-Fold: One-Step KV-Cache Recurrence for Long-Context Inference
Researchers have developed KV-Fold, a novel method for extending the context window of large language models without requiring retraining. This technique treats the key-value cache as an accumulator in a functional programming-style fold, allowing the model to process sequential chunks of data while maintaining a stable internal state. KV-Fold has demonstrated 100% exact-match retrieval on needle-in-a-haystack benchmarks across various context lengths and model sizes, operating within the memory constraints of a single GPU. AI
IMPACT Enables LLMs to process significantly longer contexts without costly retraining, potentially improving performance on tasks requiring extensive background information.