Local LLM inference boosted to 49 tokens/sec with MTP optimization

By PulseAugur Editorial · Summary by gemini-2.5-flash-lite from 1 source

An individual has detailed a three-month project to optimize LLM inference speed on a single RTX 3090 Ti, achieving up to 49 tokens per second with the Qwen3.6-27B model. This was accomplished using a multi-token prediction (MTP) technique integrated into llama.cpp, which proved more stable and faster for longer outputs compared to other speculative decoding methods like DFlash. The optimizations also included a reasoning budget adjustment, which saved time without sacrificing quality, and highlighted the significant impact of cache reuse for prefill operations. AI

Summary written by gemini-2.5-flash-lite from 1 source. How we write summaries →

IMPACT Local LLM inference speeds are improved, potentially enabling more responsive AI applications on consumer hardware.

RANK_REASON The cluster details technical experiments and optimizations for running a specific LLM locally, including performance metrics and comparisons of different techniques. [lever_c_demoted from research: ic=1 ai=1.0]

Read on dev.to — LLM tag →

COVERAGE [1]

dev.to — LLM tag TIER_1 · Ian L. Paterson · 2026-05-18 19:59

Three Months of Speed-Up Experiments on a 3090 Ti: Autoregressive DFlash MTP for Qwen3.6-27B

<h2> The setup </h2> <p>The starting line was 43 tokens per second decode on vanilla llama.cpp. The finishing line, three months later, is 39 to 49 tokens per second decode that doesn't collapse at long context, using a completely different speculative decoding technique than the…

COVERAGE [1]

Three Months of Speed-Up Experiments on a 3090 Ti: Autoregressive DFlash MTP for Qwen3.6-27B

RELATED ENTITIES

RELATED TOPICS