New MoE Architectures Enhance Efficiency and Performance

By PulseAugur Editorial · Summary by gemini-2.5-flash-lite from 17 sources

Researchers are developing advanced techniques to improve Mixture-of-Experts (MoE) models, particularly addressing challenges in domain transitions and inference efficiency. One approach, inspired by the Free Energy Principle and spiking neural networks, introduces temporal memory and anticipatory routing to significantly enhance expert selection during domain shifts. Other efforts focus on optimizing MoE inference through runtime-aware dispatch frameworks and novel kernel configurations to maximize throughput. Additionally, new methods are being explored to manage heterogeneous expert sizes and preserve knowledge from less frequently used experts during fine-tuning, aiming for better performance and resource utilization. AI

Summary written by gemini-2.5-flash-lite from 17 sources. How we write summaries →

IMPACT New methods promise more efficient and robust MoE models, potentially lowering inference costs and improving performance across diverse tasks.

RANK_REASON Multiple arXiv papers detailing novel research into Mixture-of-Experts architectures and optimizations.

Read on arXiv cs.LG →

paper
infra

COVERAGE [17]

arXiv cs.LG TIER_1 · Minbin Huang, Han Shi, Chuanyang Zheng, Yimeng Wu, Guoxuan Chen, Xintong Yu, Yichun Yin, Hong Cheng · 2026-05-08 04:00

UniPool: A Globally Shared Expert Pool for Mixture-of-Experts

arXiv:2605.06665v1 Announce Type: new Abstract: Modern Mixture-of-Experts (MoE) architectures allocate expert capacity through a rigid per-layer rule: each transformer layer owns a separate expert set. This convention couples depth scaling with linear expert-parameter growth and …
Hugging Face Daily Papers TIER_1 · 2026-05-07 17:59

UniPool: A Globally Shared Expert Pool for Mixture-of-Experts

Modern Mixture-of-Experts (MoE) architectures allocate expert capacity through a rigid per-layer rule: each transformer layer owns a separate expert set. This convention couples depth scaling with linear expert-parameter growth and assumes that every layer needs isolated expert c…
arXiv cs.AI TIER_1 · Hong Cheng · 2026-05-07 17:59

UniPool: A Globally Shared Expert Pool for Mixture-of-Experts

Modern Mixture-of-Experts (MoE) architectures allocate expert capacity through a rigid per-layer rule: each transformer layer owns a separate expert set. This convention couples depth scaling with linear expert-parameter growth and assumes that every layer needs isolated expert c…
arXiv cs.LG TIER_1 · Omkar B Shende, Marcello Traiola, Gayathri Ananthanarayanan · 2026-05-07 04:00

AxMoE: Characterizing the Impact of Approximate Multipliers on Mixture-of-Experts DNN Architectures

arXiv:2605.04754v1 Announce Type: new Abstract: Deep neural network (DNN) inference at the edge demands simultaneous improvements in accuracy, computational efficiency, and energy consumption. Approximate computing and Mixture-of-Experts (MoE) architectures have each been studied…
arXiv cs.LG TIER_1 · Reza Rastegar · 2026-05-05 04:00

Boundary Mass and the Soft-to-Hard Limit in Mixture-of-Experts

arXiv:2605.02124v1 Announce Type: new Abstract: Softmax-routed mixture-of-experts models approach hard routing as the temperature tends to zero, but this limit is singular near routing ties. This paper studies that singularity at the population level for squared-loss MoE regressi…
arXiv cs.LG TIER_1 · Man Yung Wong (Russell) · 2026-05-04 04:00

Affinity Is Not Enough: Recovering the Free Energy Principle in Mixture-of-Experts

arXiv:2605.00604v1 Announce Type: new Abstract: Sparse MoE routing fails at domain transitions, where the current token belongs to one distribution and the next to another. In a controlled experiment (4 experts, 5 seeds), standard affinity routing assigns only 0.006 +/- 0.001 pro…
arXiv cs.LG TIER_1 · Man Yung Wong · 2026-05-01 12:18

Affinity Is Not Enough: Recovering the Free Energy Principle in Mixture-of-Experts

Sparse MoE routing fails at domain transitions, where the current token belongs to one distribution and the next to another. In a controlled experiment (4 experts, 5 seeds), standard affinity routing assigns only 0.006 +/- 0.001 probability to the correct expert at the transition…
Hugging Face Daily Papers TIER_1 · 2026-04-30 14:08

Prediction-powered Inference by Mixture of Experts

The rapidly expanding artificial intelligence (AI) industry has produced diverse yet powerful prediction tools, each with its own network architecture, training strategy, data-processing pipeline, and domain-specific strengths. These tools create new opportunities for semi-superv…
arXiv cs.AI TIER_1 · Vyom Sharma, Debajyoti Datta · 2026-04-30 04:00

RaMP: Runtime-Aware Megakernel Polymorphism for Mixture-of-Experts

arXiv:2604.26039v1 Announce Type: cross Abstract: The optimal kernel configuration for Mixture-of-Experts (MoE) inference depends on both batch size and the expert routing distribution, yet production systems dispatch from batch size alone, leaving 10-70% of kernel throughput unr…
arXiv cs.CL TIER_1 · Zhicheng Ma, Xiang Liu, Zhaoxiang Liu, Ning Wang, Yi Shen, Kai Wang, Shuming Shi, Shiguo Lian · 2026-04-28 04:00

Mixture of Heterogeneous Grouped Experts for Language Modeling

arXiv:2604.23108v1 Announce Type: new Abstract: Large Language Models (LLMs) based on Mixture-of-Experts (MoE) are pivotal in industrial applications for their ability to scale performance efficiently. However, standard MoEs enforce uniform expert sizes,creating a rigidity that f…
arXiv cs.LG TIER_1 · Abhimanyu Bambhaniya, Geonhwa Jeong, Jason Park, Jiecao Yu, Jaewon Lee, Pengchao Wang, Changkyu Kim, Chunqiang Tang, Tushar Krishna · 2026-04-28 04:00

Scaling Multi-Node Mixture-of-Experts Inference Using Expert Activation Patterns

arXiv:2604.23150v1 Announce Type: new Abstract: Most recent state-of-the-art (SOTA) large language models (LLMs) use Mixture-of-Experts (MoE) architectures to scale model capacity without proportional per-token compute, enabling higher-quality outputs at manageable serving costs.…
arXiv cs.LG TIER_1 · X. Y. Han, Yuan Zhong · 2026-04-28 04:00

A Theoretical Framework for Auxiliary-Loss-Free Load Balancing of Sparse Mixture-of-Experts in Large-Scale AI Models

arXiv:2512.03915v3 Announce Type: replace-cross Abstract: In large-scale AI training, Sparse Mixture-of-Experts (s-MoE) layers enable scaling by activating only a small subset of experts per token. An operational challenge in this design is load balancing: routing tokens to minim…
arXiv cs.CL TIER_1 · Haoze He, Xingyuan Ding, Xuan Jiang, Xinkai Zou, Alex Cheng, Yibo Zhao, Juncheng Billy Li, Heather Miller · 2026-04-28 04:00

Preserving Long-Tailed Expert Information in Mixture-of-Experts Tuning

arXiv:2604.23036v1 Announce Type: cross Abstract: Despite MoE models leading many benchmarks, supervised fine-tuning (SFT) for the MoE architectures remains difficult because its router layers are fragile. Methods such as DenseMixer and ESFT mitigate router collapse with dense mi…
arXiv cs.LG TIER_1 · Zehua Pei, Ying Zhang, Hui-Ling Zhen, Tao Yuan, Xianzhi Yu, Zhenhua Dong, Sinno Jialin Pan, Mingxuan Yuan, Bei Yu · 2026-04-27 04:00

PreMoE: Proactive Inference for Efficient Mixture-of-Experts

arXiv:2505.17639v3 Announce Type: replace Abstract: Mixture-of-Experts (MoE) models offer dynamic computation, but are typically deployed as static full-capacity models, missing opportunities for deployment-specific specialization. We introduce PreMoE, a training-free framework t…
arXiv stat.ML TIER_1 · Yanwu Gu, Linglong Kong, Dong Xia · 2026-05-01 04:00

Prediction-powered Inference by Mixture of Experts

arXiv:2604.27892v1 Announce Type: new Abstract: The rapidly expanding artificial intelligence (AI) industry has produced diverse yet powerful prediction tools, each with its own network architecture, training strategy, data-processing pipeline, and domain-specific strengths. Thes…
arXiv stat.ML TIER_1 · Dong Xia · 2026-04-30 14:08

Prediction-powered Inference by Mixture of Experts

The rapidly expanding artificial intelligence (AI) industry has produced diverse yet powerful prediction tools, each with its own network architecture, training strategy, data-processing pipeline, and domain-specific strengths. These tools create new opportunities for semi-superv…
arXiv stat.ML TIER_1 · Alessandro Rinaldo · 2026-04-22 13:37

On Bayesian Softmax-Gated Mixture-of-Experts Models

Mixture-of-experts models provide a flexible framework for learning complex probabilistic input-output relationships by combining multiple expert models through an input-dependent gating mechanism. These models have become increasingly prominent in modern machine learning, yet th…

COVERAGE [17]

RELATED ENTITIES

RELATED TOPICS