ENTITY SGLang

SGLang

PulseAugur coverage of SGLang — every cluster mentioning SGLang across labs, papers, and developer communities, ranked by signal.

Show in brief

Total · 30d

49 over 90d

Releases · 30d

0 over 90d

Papers · 30d

15 over 90d

TIER MIX · 90D

frontier release 7
significant 4
research 10
tool 25
commentary 2
meme 1

TOPICS

infra 31
product 25
model release 22
paper 15
other 3
safety 2
funding 1

RELATIONSHIPS

used by vLLM 70%
used by transformers 70%
used by graphics processing unit 70%
used by Ollama 70%
used by llama-cpp-python 60%
affiliated with vLLM 50%
affiliated with transformers 50%
competes with vLLM 50%
used by llama.cpp 50%
used by Raspberry Pi 50%

TIMELINE

2026-01-09 product_launch SGLang released version 0.3.1 of its model gateway, featuring performance and memory improvements. source

SENTIMENT · 30D

19 day(s) with sentiment data

RECENT · PAGE 1/3 · 49 TOTAL

TOOL · CL_80109 · Jun 9 · 04:00

BlendServe system boosts LLM offline inference throughput

Researchers have developed BlendServe, a new system designed to optimize offline inference for auto-regressive large language models. BlendServe combines resource overlapping and prefix sharing techniques to maximize th…
TOOL · CL_78725 · Jun 8 · 19:31

LLM Inference Handbook Explains Token Generation and Optimization

This handbook delves into the engineering discipline of Large Language Model (LLM) inference, explaining how models generate tokens and the advanced optimization techniques used in production systems. It covers fundamen…
TOOL · CL_75555 · Jun 7 · 01:09

Prefix Caching Slashes LLM Prefill Costs by 80%

A new technical article explores prefix caching as a method to significantly reduce the computational cost of processing long prompts in large language models. This technique is particularly effective for workloads like…
TOOL · CL_73591 · Jun 5 · 15:09

InferBench app simplifies local LLM performance testing

A new open-source desktop application called InferBench has been released to help users determine which large language models (LLMs) can run on their local GPUs and at what speed. The tool automates the process of downl…
COMMENTARY · CL_72336 · Jun 5 · 03:48

Kimi-K2.6 performance on 8x B200 GPUs queried

A user on Reddit is seeking performance estimates for running the Kimi-K2.6 model on an 8x NVIDIA B200 GPU setup. They are specifically interested in throughput figures for long input and output sequences with a concurr…
RESEARCH · CL_74208 · Jun 4 · 08:47

QCFuse speeds up RAG serving with novel cache fusion technique

Researchers have developed QCFuse, a novel method to optimize Retrieval-Augmented Generation (RAG) serving efficiency. This technique addresses the high cost associated with processing retrieved contexts in LLMs by inte…
FRONTIER RELEASE · CL_69458 · Jun 3 · 18:46

Google DeepMind releases multimodal Gemma 4 12B for laptops

Google DeepMind has released Gemma 4 12B, an open-source multimodal AI model capable of processing text, images, audio, and video natively. This model is designed to run on consumer laptops with as little as 16 GB of RA…
SIGNIFICANT · CL_76734 · Jun 3 · 03:15

Nex-AGI releases open-source agentic model Nex-N2

Nex-AGI has released and open-sourced its new agentic model, Nex-N2, designed for real-world productivity tasks. This model boasts advanced coding and agentic capabilities, enabling it to handle complex, long-horizon ta…
TOOL · CL_60458 · May 30 · 00:00

Blackwell GPUs show 61% performance drop on Qwen3.5 model

A performance analysis by SemiAnalysis indicates that NVIDIA's Blackwell GPUs exhibit a significant 61% regression when running the SGLang Qwen3.5 397B model due to unsupported NVLink multicast for confidential computin…
TOOL · CL_74159 · May 29 · 12:37

Hcompany releases Holo-3.1-4B vision-language model

Hcompany has released Holo-3.1-4B, a new vision-language model designed for computer use agents. This model expands capabilities beyond desktop automation to include mobile environments and offers native function-callin…
RESEARCH · CL_64768 · May 29 · 09:11

Unsloth releases optimized Gemma 4 models for local use

Unsloth has released several quantized versions of the Gemma 4 model, optimized for efficient local execution. These models, including `gemma-4-12B-it-qat-GGUF` and `gemma-4-12b-it-GGUF`, are available on Hugging Face. …
MEME · CL_53447 · May 27 · 01:53

User seeks advice on local LLM coding setup with new hardware

A user on the r/LocalLLaMA subreddit is seeking advice on setting up a local coding environment. They have a new PC with an RTX 3090 GPU and an Intel Core i9 Ultra CPU, and 32GB of RAM. The user is asking for recommenda…
TOOL · CL_52595 · May 26 · 14:34

Harbor v0.4.19 launches local coding agents with integrated LLM gateway

Harbor has released version 0.4.19, introducing enhanced capabilities for launching local agentic coding tools. This update allows users to integrate various local inference backends like vLLM, SGLang, and llama.cpp. Ad…
RESEARCH · CL_64767 · May 26 · 09:09

JetBrains releases Mellum2 reasoning model with 131K context

JetBrains has released its Mellum2 model family, including the Mellum2-12B-A2.5B-Thinking variant, which is designed for complex reasoning tasks. This model utilizes a Mixture-of-Experts architecture with a large contex…
TOOL · CL_50813 · May 26 · 04:00

New method speeds up RLHF training with adaptive parallelism

Researchers have developed a new method called PAT to accelerate the training of Reinforcement Learning from Human Feedback (RLHF) models. This technique dynamically adjusts tensor parallelism during the generation stag…
FRONTIER RELEASE · CL_57657 · May 24 · 22:16

Liquid AI ships LFM2.5-8B-A1B on-device MoE model

Liquid AI has released LFM2.5-8B-A1B, a new on-device Mixture-of-Experts (MoE) model designed for complex tasks and tool chaining. This model features 8.3 billion total parameters but activates only 1.5 billion per toke…
FRONTIER RELEASE · CL_58091 · May 23 · 02:13

Stepfun AI releases 198B parameter multimodal MoE model

Stepfun AI has released Step 3.7 Flash, a 198-billion parameter sparse Mixture-of-Experts (MoE) vision-language model. This model is optimized for agentic workflows, coding, and multimodal tasks, activating approximatel…
TOOL · CL_44370 · May 22 · 16:01

Modal achieves serverless GPUs for AI inference in seconds

Modal has developed a system to achieve truly serverless GPUs for AI inference, addressing the challenge of rapidly scaling resources to meet variable demand. Their approach involves maintaining cloud buffers of idle GP…
RESEARCH · CL_48751 · May 22 · 00:00

LLMs and new frameworks boost GPU kernel optimization

Researchers are exploring novel ways to optimize GPU kernel performance for large language models. One approach uses language models as surrogates to predict kernel performance, significantly increasing the number of ca…
SIGNIFICANT · CL_49676 · May 21 · 07:27

OpenBMB releases MiniCPM5-1B for on-device AI tasks

OpenBMB has released MiniCPM5-1B, a 1-billion parameter Transformer model designed for on-device and resource-constrained environments. This model claims state-of-the-art performance within its size class, particularly …

BlendServe system boosts LLM offline inference throughput

LLM Inference Handbook Explains Token Generation and Optimization

Prefix Caching Slashes LLM Prefill Costs by 80%

InferBench app simplifies local LLM performance testing

Kimi-K2.6 performance on 8x B200 GPUs queried

QCFuse speeds up RAG serving with novel cache fusion technique

Google DeepMind releases multimodal Gemma 4 12B for laptops

Nex-AGI releases open-source agentic model Nex-N2

Blackwell GPUs show 61% performance drop on Qwen3.5 model

Hcompany releases Holo-3.1-4B vision-language model

Unsloth releases optimized Gemma 4 models for local use

User seeks advice on local LLM coding setup with new hardware

Harbor v0.4.19 launches local coding agents with integrated LLM gateway

JetBrains releases Mellum2 reasoning model with 131K context

New method speeds up RLHF training with adaptive parallelism

Liquid AI ships LFM2.5-8B-A1B on-device MoE model

Stepfun AI releases 198B parameter multimodal MoE model

Modal achieves serverless GPUs for AI inference in seconds

LLMs and new frameworks boost GPU kernel optimization

OpenBMB releases MiniCPM5-1B for on-device AI tasks