PulseAugur
EN
LIVE 21:40:02
ENTITY SGLang

SGLang

PulseAugur coverage of SGLang — every cluster mentioning SGLang across labs, papers, and developer communities, ranked by signal.

Show in brief
Total · 30d
49
49 over 90d
Releases · 30d
0
0 over 90d
Papers · 30d
15
15 over 90d
TIER MIX · 90D
TOPICS
RELATIONSHIPS
TIMELINE
  1. 2026-01-09 product_launch SGLang released version 0.3.1 of its model gateway, featuring performance and memory improvements. source
SENTIMENT · 30D

19 day(s) with sentiment data

RECENT · PAGE 1/3 · 49 TOTAL
  1. TOOL · CL_80109 ·

    BlendServe system boosts LLM offline inference throughput

    Researchers have developed BlendServe, a new system designed to optimize offline inference for auto-regressive large language models. BlendServe combines resource overlapping and prefix sharing techniques to maximize th…

  2. TOOL · CL_78725 ·

    LLM Inference Handbook Explains Token Generation and Optimization

    This handbook delves into the engineering discipline of Large Language Model (LLM) inference, explaining how models generate tokens and the advanced optimization techniques used in production systems. It covers fundamen…

  3. TOOL · CL_75555 ·

    Prefix Caching Slashes LLM Prefill Costs by 80%

    A new technical article explores prefix caching as a method to significantly reduce the computational cost of processing long prompts in large language models. This technique is particularly effective for workloads like…

  4. TOOL · CL_73591 ·

    InferBench app simplifies local LLM performance testing

    A new open-source desktop application called InferBench has been released to help users determine which large language models (LLMs) can run on their local GPUs and at what speed. The tool automates the process of downl…

  5. COMMENTARY · CL_72336 ·

    Kimi-K2.6 performance on 8x B200 GPUs queried

    A user on Reddit is seeking performance estimates for running the Kimi-K2.6 model on an 8x NVIDIA B200 GPU setup. They are specifically interested in throughput figures for long input and output sequences with a concurr…

  6. RESEARCH · CL_74208 ·

    QCFuse speeds up RAG serving with novel cache fusion technique

    Researchers have developed QCFuse, a novel method to optimize Retrieval-Augmented Generation (RAG) serving efficiency. This technique addresses the high cost associated with processing retrieved contexts in LLMs by inte…

  7. FRONTIER RELEASE · CL_69458 ·

    Google DeepMind releases multimodal Gemma 4 12B for laptops

    Google DeepMind has released Gemma 4 12B, an open-source multimodal AI model capable of processing text, images, audio, and video natively. This model is designed to run on consumer laptops with as little as 16 GB of RA…

  8. SIGNIFICANT · CL_76734 ·

    Nex-AGI releases open-source agentic model Nex-N2

    Nex-AGI has released and open-sourced its new agentic model, Nex-N2, designed for real-world productivity tasks. This model boasts advanced coding and agentic capabilities, enabling it to handle complex, long-horizon ta…

  9. TOOL · CL_60458 ·

    Blackwell GPUs show 61% performance drop on Qwen3.5 model

    A performance analysis by SemiAnalysis indicates that NVIDIA's Blackwell GPUs exhibit a significant 61% regression when running the SGLang Qwen3.5 397B model due to unsupported NVLink multicast for confidential computin…

  10. TOOL · CL_74159 ·

    Hcompany releases Holo-3.1-4B vision-language model

    Hcompany has released Holo-3.1-4B, a new vision-language model designed for computer use agents. This model expands capabilities beyond desktop automation to include mobile environments and offers native function-callin…

  11. RESEARCH · CL_64768 ·

    Unsloth releases optimized Gemma 4 models for local use

    Unsloth has released several quantized versions of the Gemma 4 model, optimized for efficient local execution. These models, including `gemma-4-12B-it-qat-GGUF` and `gemma-4-12b-it-GGUF`, are available on Hugging Face. …

  12. MEME · CL_53447 ·

    User seeks advice on local LLM coding setup with new hardware

    A user on the r/LocalLLaMA subreddit is seeking advice on setting up a local coding environment. They have a new PC with an RTX 3090 GPU and an Intel Core i9 Ultra CPU, and 32GB of RAM. The user is asking for recommenda…

  13. TOOL · CL_52595 ·

    Harbor v0.4.19 launches local coding agents with integrated LLM gateway

    Harbor has released version 0.4.19, introducing enhanced capabilities for launching local agentic coding tools. This update allows users to integrate various local inference backends like vLLM, SGLang, and llama.cpp. Ad…

  14. RESEARCH · CL_64767 ·

    JetBrains releases Mellum2 reasoning model with 131K context

    JetBrains has released its Mellum2 model family, including the Mellum2-12B-A2.5B-Thinking variant, which is designed for complex reasoning tasks. This model utilizes a Mixture-of-Experts architecture with a large contex…

  15. TOOL · CL_50813 ·

    New method speeds up RLHF training with adaptive parallelism

    Researchers have developed a new method called PAT to accelerate the training of Reinforcement Learning from Human Feedback (RLHF) models. This technique dynamically adjusts tensor parallelism during the generation stag…

  16. FRONTIER RELEASE · CL_57657 ·

    Liquid AI ships LFM2.5-8B-A1B on-device MoE model

    Liquid AI has released LFM2.5-8B-A1B, a new on-device Mixture-of-Experts (MoE) model designed for complex tasks and tool chaining. This model features 8.3 billion total parameters but activates only 1.5 billion per toke…

  17. FRONTIER RELEASE · CL_58091 ·

    Stepfun AI releases 198B parameter multimodal MoE model

    Stepfun AI has released Step 3.7 Flash, a 198-billion parameter sparse Mixture-of-Experts (MoE) vision-language model. This model is optimized for agentic workflows, coding, and multimodal tasks, activating approximatel…

  18. TOOL · CL_44370 ·

    Modal achieves serverless GPUs for AI inference in seconds

    Modal has developed a system to achieve truly serverless GPUs for AI inference, addressing the challenge of rapidly scaling resources to meet variable demand. Their approach involves maintaining cloud buffers of idle GP…

  19. RESEARCH · CL_48751 ·

    LLMs and new frameworks boost GPU kernel optimization

    Researchers are exploring novel ways to optimize GPU kernel performance for large language models. One approach uses language models as surrogates to predict kernel performance, significantly increasing the number of ca…

  20. SIGNIFICANT · CL_49676 ·

    OpenBMB releases MiniCPM5-1B for on-device AI tasks

    OpenBMB has released MiniCPM5-1B, a 1-billion parameter Transformer model designed for on-device and resource-constrained environments. This model claims state-of-the-art performance within its size class, particularly …