Mixture of Experts

Mixture of Experts introduces conditional computation, where only a subset of specialized sub-networks (experts) is activated per input example via a gating network, enabling model capacity to scale without a proportional increase in compute cost.

Expert Networks

Collection of N parallel sub-networks (experts), each specializing in a distinct subset of the input space; in the Transformer context, experts are typically FFN networks.

Modular

A set of N parallel sub-networks, each independently parameterized. In the Transformer context, experts are typically feed-forward networks (FFN) with identical architecture but separate weight matrices. Each expert learns to specialize on a different subset of the input distribution as a result of competitive routing.

Gating / Router Network

Gating network trained jointly with the experts; produces weights or top-k expert selections for each input token / example.

Modular

A trainable network (typically a linear projection followed by softmax) that computes a score for each expert given the current input token. In sparse MoE, only the top-k experts by score are activated; in soft MoE, all experts are weighted and summed. The router parameters are optimized jointly with the expert parameters via gradient descent.

Load Balancing Mechanism

Auxiliary loss term added to the primary training loss, penalizing uneven token distribution across experts.

Modular

An auxiliary loss term added to the training objective that measures the imbalance of token routing across experts and penalizes skewed distributions. Without this mechanism, the router tends to collapse onto a small number of dominant experts through a self-reinforcing feedback loop. The specific formulation varies across implementations (importance loss + load loss in Shazeer et al. 2017; simplified scalar auxiliary loss in Switch Transformer).

Time

…

k = number of top-k experts activated per token (typically 1 or 2); C_expert = compute cost of a single expert forward pass; N = total number of experts. Total parameter count scales as O(N · C_expert), but per-token FLOPs scale only as O(k · C_expert), independent of N.

The key property of sparse MoE is sub-linear scaling of per-token compute relative to total model parameters: doubling the number of experts roughly doubles parameter count but does not change per-token FLOPs. This assumes perfect load balancing; token overflow due to capacity constraints introduces additional overhead.

Memory complexity

…

N = number of experts; P_expert = parameter count per expert. All expert weights must be resident in memory simultaneously (or distributed across devices via expert parallelism). The router adds negligible parameters (d_model × N).

In distributed settings with expert parallelism, each device holds 1/D of the experts (D = number of devices), so per-device memory is O(N/D · P_expert). All-to-all communication for token dispatch and result collection adds communication overhead proportional to batch size.

Wąskie gardło: All-to-all communication in expert parallelism

In distributed MoE with expert parallelism, tokens must be dispatched from their originating device to the device holding the selected expert, and results must be collected back. This all-to-all communication scales with batch size and number of devices and becomes the dominant bottleneck at large scale.

Parallelism

Conditionally parallel

Within a single device, expert computations are fully parallel across tokens assigned to that expert. Across devices, expert parallelism is used: each device holds a subset of experts and processes only the tokens routed to it.

Paradigm

Conditional

Top-K selected

In the original soft MoE formulation (Jacobs et al. 1991), all experts are weighted and summed (all_paths_active). The sparse top-k variant (Shazeer et al. 2017) is dominant in modern LLM applications.

Number of Experts (N)

Critical

8Mixtral 8x7B, Switch Transformer small
64Switch Transformer large configurations
2048+Shazeer et al. 2017: up to thousands of experts with hierarchical MoE

Total number of expert sub-networks. Controls the parameter count of the MoE layer. Increasing N scales model capacity without increasing per-token FLOPs. Common values range from 8 to thousands.

Top-k (Active Expert Count)

Critical

1Switch Transformer
2Mixtral, GShard, most common default

Number of experts activated per token per MoE layer. k=1 (Switch Transformer) minimizes compute; k=2 is the most common value in practice. Higher k improves routing stability but increases per-token FLOPs.

Capacity factor

Standard

1.25Recommended starting value per GShard / Switch Transformer
1.0Tight capacity, some token dropping expected

Multiplier on the average number of tokens per expert per batch. Determines the maximum token buffer per expert. Values above 1.0 reduce token dropping at the cost of higher memory. Values below 1.0 increase dropping.

Auxiliary loss coefficient

Standard

Scaling coefficient (alpha) for the load balancing auxiliary loss added to the training objective. Too high causes instability and degrades model quality; too low leads to expert collapse. Requires careful tuning per model scale.

MoE Layer Frequency

Standard

co 2 warstwyCommon pattern in many LLM MoE architectures.
wszystkie warstwy FFNSwitch Transformer

In Transformer-based MoE models, not every FFN layer is replaced by a MoE layer. The interleaving pattern (e.g., every other layer, every 4th layer) controls the tradeoff between expert capacity and communication overhead.

Common pitfalls

Expert collapse and load imbalance

CRITICAL

Without explicit load balancing, the router converges to routing most or all tokens to a small subset of experts through a self-reinforcing feedback loop: favored experts receive more training signal, become better, and are selected more often. This leaves most experts undertrained and wastes model capacity.

Add an auxiliary load balancing loss to the training objective. Alternatively, use auxiliary-loss-free approaches such as expert-wise routing bias with dynamic updates (DeepSeek approach). Monitor per-expert token counts during training. Consider noisy top-k gating.

Difficult auxiliary loss coefficient tuning

HIGH

The auxiliary loss coefficient (alpha) must be carefully tuned. Too large a value causes the auxiliary loss to dominate the training objective, degrading model quality. Too small a value fails to prevent expert collapse. The optimal value depends on model scale, batch size, and number of experts.

Start with values in the range suggested for the chosen architecture (e.g., alpha=1e-2 in Switch Transformer). Monitor both load balance metrics and downstream task loss. Consider sweep over alpha early in training on a smaller scale.

Capacity factor overflow and token dropping

MEDIUM

When more tokens are routed to an expert than its capacity allows, the excess tokens are dropped (or passed through a residual connection without expert processing). Token dropping degrades model quality, especially for tokens in high-demand input regions.

Set the capacity factor above 1.0 (e.g., 1.25) to provide buffer. Monitor overflow rates during training. Consider expert choice routing (experts select top-k tokens rather than tokens selecting experts) to guarantee perfect load balancing.

All-to-all communication overhead in expert parallelism

HIGH

Distributed MoE with expert parallelism requires all-to-all communication to dispatch tokens to their assigned expert devices and collect results. At large scale, this communication overhead can become the dominant bottleneck, especially on clusters with limited inter-node bandwidth.

Minimize the number of MoE layers (use MoE only in a fraction of Transformer layers). Use top-1 instead of top-2 routing to halve dispatch volume. Overlap communication with computation where the framework supports it. Profile all-to-all latency early.

Training instability caused by top-k routing discontinuity

MEDIUM

The top-k selection operation is not differentiable, which can introduce high-variance gradients through the router and cause training instability, especially at large learning rates or with aggressive capacity constraints.

Use gradient clipping. Reduce the learning rate relative to the dense baseline. Apply noisy gating during training to smooth routing decisions. Consider soft MoE or differentiable routing variants if instability persists.

Reference implementations

lucidrains/mixture-of-experts (GitHub)

Python · Phil Wang (lucidrains)

GENESIS · Source paper

Adaptive Mixtures of Local Experts

1991Neural Computation, vol. 3, no. 1, pp. 79–87Robert A. Jacobs, Michael I. Jordan, Steven J. Nowlan et al.

1991

Concept of MoE defined — Jacobs, Jordan, Nowlan, Hinton

breakthrough

Jacobs et al. introduce the Mixture of Experts architecture: a system of parallel expert networks with a gating network that produces soft weights over experts via softmax, trained jointly via a supervised learning procedure. Demonstrates task decomposition on a vowel discrimination task.

Adaptive Mixtures of Local Experts

1994

Hierarchical Mixtures of Experts (HME) — Jordan & Jacobs

Jordan and Jacobs extend the MoE framework to a hierarchical tree structure where each node is itself a gating network, enabling recursive decomposition of the input space. Training is formalized with an EM algorithm.

Hierarchical Mixtures of Experts and the EM Algorithm

2017

Sparsely-Gated MoE for deep networks — Shazeer et al., ICLR 2017

breakthrough

Shazeer et al. (Google Brain) introduce the Sparsely-Gated Mixture-of-Experts layer: sparse top-k gating with noisy gating for load balancing, applied convolutionally between LSTM layers. Achieves over 1000x improvement in model capacity with minor computational overhead. Demonstrates models with up to 137 billion parameters. This paper establishes the modern sparse MoE paradigm for large-scale deep learning.

Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer

2020

GShard — scaling MoE to 600B parameters with automatic sharding

Lepikhin et al. (Google) apply sparse MoE to Transformer encoder-decoder models at 600B parameter scale using automatic sharding (XLA SPMD). Introduces per-expert capacity limits and random routing for the second expert in top-2 setups to improve load balancing.

GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding

2021

Switch Transformer — simplification to top-1 routing and scaling to one trillion parameters

breakthrough

Fedus, Zoph, and Shazeer (Google) demonstrate that top-1 routing (each token routed to exactly one expert) achieves competitive quality with simpler implementation and lower communication overhead than top-2. Scale to 1.6 trillion parameters. Introduce a simplified auxiliary load balancing loss.

Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity

2024

Auxiliary-loss-free load balancing — DeepSeek and subsequent MoE architectures

Architectures such as DeepSeek-MoE and subsequent work demonstrate that auxiliary-loss-free load balancing (via expert-wise bias on routing scores) achieves better model quality than traditional auxiliary loss approaches, avoiding the interference gradients introduced by load balancing losses on model training.

Auxiliary-Loss-Free Load Balancing Strategy for Mixture-of-Experts

GPU Tensor CoresPRIMARY

Sparse MoE training and inference at scale requires GPU clusters with high-bandwidth interconnects (NVLink, InfiniBand) for efficient all-to-all communication in expert parallelism. Expert FFN computations are dense matrix multiplications that benefit from Tensor Core acceleration.

Expert parallelism distributes experts across GPUs; each GPU executes dense FFN computation for its assigned experts. All-to-all token dispatch benefits from high-bandwidth inter-GPU connectivity.

TPUGOOD

Google's GShard and Switch Transformer were developed and trained on TPU pods using XLA SPMD for automatic sharding. TPU's ICI interconnect provides high-bandwidth all-to-all communication well-suited to expert parallelism.

Static shape requirements of XLA/TPU require fixed capacity factor and token buffer sizes at compile time, making dynamic routing less flexible than on GPU.

BUILT ON

Transformer

Transformer is a neural network architecture proposed by Vaswani et al. in „Attention Is All You Need" (NeurIPS 2017). It replaced earlier approaches based on recurrent (RNN, LSTM) and convolutional (CNN) networks in sequential tasks. The key element is the multi-head self-attention mechanism, which allows every position in a sequence to directly participate in computations involving every other position, enabling the model to learn long-range dependencies in constant (not linear, as in RNNs) time. The architecture consists of encoder and decoder blocks (or encoder-only / decoder-only variants) containing: multi-head attention layers, feed-forward networks, residual connections, and layer normalization (LayerNorm). Sequence positions are encoded via positional encoding (sinusoidal or learned). Transformer has become the foundation of LLMs (GPT, BERT, T5, LLaMA, Claude, Gemini), Vision Transformers (ViT), multimodal models (CLIP, Flamingo), and tabular foundation models (TabPFN). The main limitation — quadratic attention complexity with respect to sequence length (O(n²)) — is an active research direction (FlashAttention, sliding window, linear attention, SSM).

GO TO CONCEPT