New MoE Architectures Enhance Efficiency and Performance
ByPulseAugur Editorial·
Summary by gemini-2.5-flash-lite
from 17 sources
Researchers are developing advanced techniques to improve Mixture-of-Experts (MoE) models, particularly addressing challenges in domain transitions and inference efficiency. One approach, inspired by the Free Energy Principle and spiking neural networks, introduces temporal memory and anticipatory routing to significantly enhance expert selection during domain shifts. Other efforts focus on optimizing MoE inference through runtime-aware dispatch frameworks and novel kernel configurations to maximize throughput. Additionally, new methods are being explored to manage heterogeneous expert sizes and preserve knowledge from less frequently used experts during fine-tuning, aiming for better performance and resource utilization.
AI
arXiv:2605.06665v1 Announce Type: new Abstract: Modern Mixture-of-Experts (MoE) architectures allocate expert capacity through a rigid per-layer rule: each transformer layer owns a separate expert set. This convention couples depth scaling with linear expert-parameter growth and …
Modern Mixture-of-Experts (MoE) architectures allocate expert capacity through a rigid per-layer rule: each transformer layer owns a separate expert set. This convention couples depth scaling with linear expert-parameter growth and assumes that every layer needs isolated expert c…
Modern Mixture-of-Experts (MoE) architectures allocate expert capacity through a rigid per-layer rule: each transformer layer owns a separate expert set. This convention couples depth scaling with linear expert-parameter growth and assumes that every layer needs isolated expert c…
arXiv cs.LG
TIER_1·Omkar B Shende, Marcello Traiola, Gayathri Ananthanarayanan·
arXiv:2605.04754v1 Announce Type: new Abstract: Deep neural network (DNN) inference at the edge demands simultaneous improvements in accuracy, computational efficiency, and energy consumption. Approximate computing and Mixture-of-Experts (MoE) architectures have each been studied…
arXiv:2605.02124v1 Announce Type: new Abstract: Softmax-routed mixture-of-experts models approach hard routing as the temperature tends to zero, but this limit is singular near routing ties. This paper studies that singularity at the population level for squared-loss MoE regressi…
arXiv:2605.00604v1 Announce Type: new Abstract: Sparse MoE routing fails at domain transitions, where the current token belongs to one distribution and the next to another. In a controlled experiment (4 experts, 5 seeds), standard affinity routing assigns only 0.006 +/- 0.001 pro…
Sparse MoE routing fails at domain transitions, where the current token belongs to one distribution and the next to another. In a controlled experiment (4 experts, 5 seeds), standard affinity routing assigns only 0.006 +/- 0.001 probability to the correct expert at the transition…
The rapidly expanding artificial intelligence (AI) industry has produced diverse yet powerful prediction tools, each with its own network architecture, training strategy, data-processing pipeline, and domain-specific strengths. These tools create new opportunities for semi-superv…
arXiv:2604.26039v1 Announce Type: cross Abstract: The optimal kernel configuration for Mixture-of-Experts (MoE) inference depends on both batch size and the expert routing distribution, yet production systems dispatch from batch size alone, leaving 10-70% of kernel throughput unr…
arXiv cs.CL
TIER_1·Zhicheng Ma, Xiang Liu, Zhaoxiang Liu, Ning Wang, Yi Shen, Kai Wang, Shuming Shi, Shiguo Lian·
arXiv:2604.23108v1 Announce Type: new Abstract: Large Language Models (LLMs) based on Mixture-of-Experts (MoE) are pivotal in industrial applications for their ability to scale performance efficiently. However, standard MoEs enforce uniform expert sizes,creating a rigidity that f…
arXiv cs.LG
TIER_1·Abhimanyu Bambhaniya, Geonhwa Jeong, Jason Park, Jiecao Yu, Jaewon Lee, Pengchao Wang, Changkyu Kim, Chunqiang Tang, Tushar Krishna·
arXiv:2604.23150v1 Announce Type: new Abstract: Most recent state-of-the-art (SOTA) large language models (LLMs) use Mixture-of-Experts (MoE) architectures to scale model capacity without proportional per-token compute, enabling higher-quality outputs at manageable serving costs.…
arXiv:2512.03915v3 Announce Type: replace-cross Abstract: In large-scale AI training, Sparse Mixture-of-Experts (s-MoE) layers enable scaling by activating only a small subset of experts per token. An operational challenge in this design is load balancing: routing tokens to minim…
arXiv cs.CL
TIER_1·Haoze He, Xingyuan Ding, Xuan Jiang, Xinkai Zou, Alex Cheng, Yibo Zhao, Juncheng Billy Li, Heather Miller·
arXiv:2604.23036v1 Announce Type: cross Abstract: Despite MoE models leading many benchmarks, supervised fine-tuning (SFT) for the MoE architectures remains difficult because its router layers are fragile. Methods such as DenseMixer and ESFT mitigate router collapse with dense mi…
arXiv:2505.17639v3 Announce Type: replace Abstract: Mixture-of-Experts (MoE) models offer dynamic computation, but are typically deployed as static full-capacity models, missing opportunities for deployment-specific specialization. We introduce PreMoE, a training-free framework t…
arXiv:2604.27892v1 Announce Type: new Abstract: The rapidly expanding artificial intelligence (AI) industry has produced diverse yet powerful prediction tools, each with its own network architecture, training strategy, data-processing pipeline, and domain-specific strengths. Thes…
The rapidly expanding artificial intelligence (AI) industry has produced diverse yet powerful prediction tools, each with its own network architecture, training strategy, data-processing pipeline, and domain-specific strengths. These tools create new opportunities for semi-superv…
Mixture-of-experts models provide a flexible framework for learning complex probabilistic input-output relationships by combining multiple expert models through an input-dependent gating mechanism. These models have become increasingly prominent in modern machine learning, yet th…