Qwen3.6-27B model achieves 80 TPS with 218k context on single RTX 5090

By PulseAugur Editorial · Summary by gemini-2.5-flash-lite from 1 source

A user on Reddit's r/LocalLLaMA community has shared details on achieving high performance with the Qwen3.6-27B model. By utilizing the NVFP4 with MTP quantization and the vLLM 0.19 inference server, they reported approximately 80 tokens per second with a 218,000 token context window on a single RTX 5090 graphics card. This setup builds upon previous experiments with the Qwen3.5-27B model, demonstrating significant advancements in local LLM deployment efficiency. AI

Summary written by gemini-2.5-flash-lite from 1 source. How we write summaries →

IMPACT Demonstrates efficient local deployment of large context models, potentially lowering barriers for advanced LLM use on consumer hardware.

RANK_REASON Release of a specific model version with performance metrics shared by a community member.

Read on r/LocalLLaMA →

COVERAGE [1]

r/LocalLLaMA TIER_1 · /u/Kindly-Cantaloupe978 · 2026-04-25 10:21

Qwen3.6-27B at ~80 tps with 218k context window on 1x RTX 5090 served by vllm 0.19

<div class="md"><p>Qwen3.6-27B is out for a few days and the NVFP4 with MTP is dropped earlier on HF: <a href="https://huggingface.co/sakamakismile/Qwen3.6-27B-Text-NVFP4-MTP">https://huggingface.co/sakamakismile/Qwen3.6-27B-Text-NVFP4-MTP</a></p> <p>Can follow the…

COVERAGE [1]

Qwen3.6-27B at ~80 tps with 218k context window on 1x RTX 5090 served by vllm 0.19

RELATED ENTITIES

RELATED TOPICS