An individual has detailed a three-month project to optimize LLM inference speed on a single RTX 3090 Ti, achieving up to 49 tokens per second with the Qwen3.6-27B model. This was accomplished using a multi-token prediction (MTP) technique integrated into llama.cpp, which proved more stable and faster for longer outputs compared to other speculative decoding methods like DFlash. The optimizations also included a reasoning budget adjustment, which saved time without sacrificing quality, and highlighted the significant impact of cache reuse for prefill operations. AI
Summary written by gemini-2.5-flash-lite from 1 source. How we write summaries →
IMPACT Local LLM inference speeds are improved, potentially enabling more responsive AI applications on consumer hardware.
RANK_REASON The cluster details technical experiments and optimizations for running a specific LLM locally, including performance metrics and comparisons of different techniques. [lever_c_demoted from research: ic=1 ai=1.0]