PulseAugur
LIVE 00:04:02
research · [1 source] ·
0
research

Qwen3.6-27B model achieves 80 TPS with 218k context on single RTX 5090

A user on Reddit's r/LocalLLaMA community has shared details on achieving high performance with the Qwen3.6-27B model. By utilizing the NVFP4 with MTP quantization and the vLLM 0.19 inference server, they reported approximately 80 tokens per second with a 218,000 token context window on a single RTX 5090 graphics card. This setup builds upon previous experiments with the Qwen3.5-27B model, demonstrating significant advancements in local LLM deployment efficiency. AI

Summary written by gemini-2.5-flash-lite from 1 source. How we write summaries →

IMPACT Demonstrates efficient local deployment of large context models, potentially lowering barriers for advanced LLM use on consumer hardware.

RANK_REASON Release of a specific model version with performance metrics shared by a community member.

Read on r/LocalLLaMA →

COVERAGE [1]

  1. r/LocalLLaMA TIER_1 · /u/Kindly-Cantaloupe978 ·

    Qwen3.6-27B at ~80 tps with 218k context window on 1x RTX 5090 served by vllm 0.19

    <!-- SC_OFF --><div class="md"><p>Qwen3.6-27B is out for a few days and the NVFP4 with MTP is dropped earlier on HF: <a href="https://huggingface.co/sakamakismile/Qwen3.6-27B-Text-NVFP4-MTP">https://huggingface.co/sakamakismile/Qwen3.6-27B-Text-NVFP4-MTP</a></p> <p>Can follow the…