New MoE inference design uses pooled HBM to cut communication latency on Ascend

By PulseAugur Editorial · Summary by gemini-2.5-flash-lite from 1 source

Researchers have developed a new communication design for Mixture-of-Experts (MoE) inference on Ascend systems, aiming to reduce bottlenecks in token exchange. This approach eliminates intermediate relay and reordering buffers by directly placing data into destination expert windows and reading from remote ones. The system leverages globally pooled high-bandwidth memory and symmetric memory allocation, resulting in improved time to first token and competitive time per output token for MoE workloads. AI

Summary written by gemini-2.5-flash-lite from 1 source. How we write summaries →

IMPACT This research could lead to more efficient inference for large MoE models on specific hardware platforms.

RANK_REASON This is a research paper detailing a novel technical approach for optimizing MoE inference. [lever_c_demoted from research: ic=1 ai=1.0]

Read on arXiv cs.LG →

paper
infra

COVERAGE [1]

arXiv cs.LG TIER_1 · Tianlun Hu, Tiancheng Hu, Shengsheng Litang, Sheng Wang, Xiaoming Bao, Yuxing Li, Wei Wang, Zhongzhe Hu, Lijun Li, Hongwei Sun, Jingbin Zhou\\ · 2026-05-08 04:00

Relay Buffer Independent Communication over Pooled HBM for Efficient MoE Inference on Ascend

arXiv:2605.06055v1 Announce Type: cross Abstract: Mixture-of-Experts (MoE) inference requires large-scale token exchange across devices, making dispatch and combine major bottlenecks in both prefill and decode. Beyond network transfer, routing-driven layout transformation, tempor…

COVERAGE [1]

Relay Buffer Independent Communication over Pooled HBM for Efficient MoE Inference on Ascend

RELATED ENTITIES

RELATED TOPICS