vLLM
PulseAugur coverage of vLLM — every cluster mentioning vLLM across labs, papers, and developer communities, ranked by signal.
- used by Nexus Labs 90%
- used by H.1000 Gnome 80%
- used by graphics processing unit 70%
- used by llama-cpp-python 70%
- used by LM Studio 70%
- used by Fp8 70%
- competes with Text Generation Inference 70%
- used by Ray 70%
- used by Prometheus 70%
- used by Horizon 2020 70%
- used by A100 70%
- uses Anyscale, Inc. 70%
- 2026-06-04 product_launch vLLM released version 0.22.1, including a fix for DeepSeek-V4 initialization compatibility. source
- 2026-05-29 product_launch vLLM merged a pull request for a new HIP W4A16 kernel, enhancing performance. source
- 2026-05-28 product_launch vLLM released version 0.22.0rc3. source
- 2026-05-26 product_launch Nexus Labs implemented and tested vLLM's prefix caching feature, observing significant latency improvements for AI agents. source
- 2026-05-15 product_launch vLLM released version 0.21.1rc0.
28 day(s) with sentiment data
-
User builds custom LLM server with EPYC CPU and 4x RTX 3090 GPUs
A user has completed the assembly of a powerful custom server designed for running large language models (LLMs). The build features an AMD EPYC 9575F processor, 768GB of RAM, and four NVIDIA RTX 3090 GPUs with a total o…
-
Kimi-K2.6 performance on 8x B200 GPUs queried
A user on Reddit is seeking performance estimates for running the Kimi-K2.6 model on an 8x NVIDIA B200 GPU setup. They are specifically interested in throughput figures for long input and output sequences with a concurr…
-
User doubles LLM inference speed by fixing PCIe slot bottleneck
A user building a multi-GPU setup for local LLM inference discovered a significant performance bottleneck caused by a misconfigured PCIe slot. One of the four RTX 3090 GPUs was incorrectly placed in a slot that only sup…
-
Huawei KVarN boosts vLLM KV-cache for larger AI context
Huawei has released KVarN, a new backend for the vLLM framework that enhances KV-cache quantization. This innovation aims to significantly increase context window sizes, with one source suggesting a 35x improvement. KVa…
-
Kubernetes operators enable scale-to-zero for LLM serving
New Kubernetes operators are emerging to address the cost of running large language models, particularly the issue of idle GPUs burning money. Hearth, an alpha-stage operator, allows users to declaratively serve open-so…
-
Hugging Face updates ASR leaderboard, vLLM advances to v1
Hugging Face has updated its Open ASR Leaderboard with a new entry called Benchmaxxer Repellant. Additionally, vLLM has transitioned from version 0 to version 1, focusing on pre-correction accuracy in reinforcement learning.
-
Gemma 4 12B local AI model requires configuration tweaks for optimal performance
Google's Gemma 4 12B model shows promise for local AI setups, but users report that default configurations in tools like LM Studio can hinder its reasoning capabilities. Specific adjustments to Jinja templates and sampl…
-
vLLM fixes DeepSeek-V4 init compatibility in new release
vLLM has released version 0.22.1, with a release candidate v0.22.1rc2 also available. These releases address a compatibility issue with CUTLASS fmin initialization specifically for the DeepSeek-V4 model. The fix ensures…
-
Google DeepMind releases multimodal Gemma 4 12B for laptops
Google DeepMind has released Gemma 4 12B, an open-source multimodal AI model capable of processing text, images, audio, and video natively. This model is designed to run on consumer laptops with as little as 16 GB of RA…
-
llama.cpp build b9455 achieves 70+ tokens/sec on Qwen3.6-27B
A user on Reddit's r/LocalLLaMA community shared impressive performance gains using a new build of llama.cpp, specifically version b9455. This updated version, when combined with tensor splitting across two RTX 3090 GPU…
-
Nex-AGI releases open-source agentic model Nex-N2
Nex-AGI has released and open-sourced its new agentic model, Nex-N2, designed for real-world productivity tasks. This model boasts advanced coding and agentic capabilities, enabling it to handle complex, long-horizon ta…
-
vLLM releases 0.22.1rc1 with flashinfer-jit-cache update
vLLM has released version 0.22.1rc1, which includes a change to stop using extra-index-url for flashinfer-jit-cache. This update addresses a specific technical detail within the project's caching mechanism. The release …
-
Macs vs. NVIDIA GPUs: Choosing the Right Hardware for Local LLMs
For running large language models locally, Apple Silicon Macs and NVIDIA GPUs offer distinct advantages. Macs excel at inference for larger models due to their unified memory architecture, allowing them to handle models…
-
Developers can cut LLM API costs with local pipelines
Developers can significantly reduce costs by building their own local LLM pipelines instead of relying solely on cloud APIs. While cloud services are ideal for production, local models like Llama 3 and Mistral offer suf…
-
Qwen2.5-32B achieves zero errors in 2,859 LLM code generation tests
A developer meticulously tested the Qwen2.5-32B model using the EvalScope framework, running 2,859 code generation prompts. The tests, which covered structured JSON output, function calling, and tool use, surprisingly y…
-
Nexus Labs team learns small eval gains are often statistical noise
A machine learning team at Nexus Labs discovered that a recent model promotion was based on a statistically insignificant performance gain. Their internal evaluation suite, which uses exact-match checks, showed a 2.1-po…
-
AI inference verification achieved with bit-exact precision
Researchers have developed a method to verify AI inference results with bit-exact precision, overcoming the challenge posed by non-deterministic GPU arithmetic. Their approach analyzes accumulated rounding errors as an …
-
Odysseus launches as privacy-focused, self-hosted AI workspace
Odysseus is a self-hosted AI workspace emphasizing local-first operation and user privacy. It integrates various functionalities including chat, agents, a cookbook for model management, deep research tools, model compar…
-
JetBrains ships Mellum2, Heretic tool aids censorship removal, NVIDIA launches Cosmos 3
JetBrains has released Mellum2, a 12-billion parameter Mixture-of-Experts model optimized for efficient local AI inference. Concurrently, a new tool called 'Heretic' has emerged on GitHub, designed to automatically remo…
-
AWS cuts LLM load times with GPUDirect Storage and FSx
AWS has introduced a new method to significantly speed up the loading of large language models onto GPU instances. By leveraging NVIDIA GPUDirect Storage (GDS) with Amazon FSx for Lustre, model weights can be loaded dir…