Back to research
AI/ML/NLPNovember 20, 202512 min read

Mixture of Experts Explained: How Mixtral, DeepSeek & Grok Scale to Trillions of Parameters

Understand MoE architecture - the technology behind Mixtral 8x7B, DeepSeek, and Grok. Learn expert routing, load balancing, and why sparse models beat dense ones.

Space Services

Space Services

Mixture of Experts Explained: How Mixtral, DeepSeek & Grok Scale to Trillions of Parameters
Share:

The pursuit of larger and more capable language models has led researchers to a fundamental tension: model performance tends to improve with scale, but computational costs grow prohibitively. Mixture of Experts (MoE) architectures offer an elegant solution to this dilemma, enabling models with trillions of parameters while keeping inference costs manageable through sparse computation.

In this article, we explore the mathematical foundations of MoE architectures, examine how expert routing works, and discuss the engineering challenges that must be overcome for practical deployment.

Mixture of Experts Routing Visualization
See how the gating network routes different inputs to specialized experts
Ctrl/Cmd + Enter to run

The Core Insight: Conditional Computation

Traditional dense neural networks activate every parameter for every input. A 70-billion parameter model performs 70 billion operations per forward pass, regardless of input complexity. MoE architectures challenge this assumption with a simple observation: not all parameters need to be active for every input.

Consider how human experts operate. When you have a medical question, you consult a physician—not a lawyer, accountant, and physicist simultaneously. MoE models apply this principle computationally: they maintain multiple "expert" subnetworks and route each input to only a subset of relevant experts.

┌─────────────────────────────────────────────────────────────────┐
│                    MoE Transformer Layer                        │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│    Input Token                                                  │
│         │                                                       │
│         ▼                                                       │
│  ┌─────────────┐                                                │
│  │  Attention  │                                                │
│  └──────┬──────┘                                                │
│         │                                                       │
│         ▼                                                       │
│  ┌─────────────┐      ┌───────┐ ┌───────┐ ┌───────┐ ┌───────┐  │
│  │   Router    │─────▶│ Exp 1 │ │ Exp 2 │ │ Exp 3 │ │ Exp N │  │
│  │  (Gating)   │      └───┬───┘ └───┬───┘ └───┬───┘ └───┬───┘  │
│  └─────────────┘          │         │         │         │       │
│         │                 └────┬────┴────┬────┘         │       │
│    Routing                     │         │              │       │
│    Weights            Selected Experts (TopK)     Not Selected  │
│         │                      │         │                      │
│         ▼                      ▼         ▼                      │
│  ┌─────────────────────────────────────────┐                    │
│  │     Weighted Sum of Expert Outputs      │                    │
│  └────────────────────┬────────────────────┘                    │
│                       │                                         │
│                       ▼                                         │
│                 Output Token                                    │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

The Gating Mechanism: Mathematical Foundations

The router (or gating network) is the brain of an MoE layer. Given an input representation xx, it must decide which experts should process this input and with what weights.

Router Architecture

The simplest and most common router uses a learned linear transformation followed by a softmax:

pi=eh(x)ijeh(x)jp_i = \frac{e^{h(x)_i}}{\sum_j e^{h(x)_j}}

Where h(x)=xWgh(x) = x \cdot W_g is the router's hidden representation, WgRd×nW_g \in \mathbb{R}^{d \times n} is a learned weight matrix, dd is the hidden dimension, and nn is the number of experts.

Top-K Selection

Computing outputs from all experts would defeat the purpose of sparse computation. Instead, MoE models select only the top-KK experts (typically K=1K=1 or K=2K=2):

G(x)=Softmax(TopK(xWg))G(x) = \text{Softmax}(\text{TopK}(x \cdot W_g))

The TopK operation zeros out all but the KK highest-scoring experts. The final layer output becomes:

y=i=1nG(x)iEi(x)y = \sum_{i=1}^{n} G(x)_i \cdot E_i(x)

Where Ei(x)E_i(x) is the output of expert ii. Due to the sparsity induced by TopK, this sum only has KK non-zero terms, dramatically reducing computation.

A Concrete Example

Consider a layer with 8 experts and K=2K=2. For an input token:

Router scores:  [0.15, 0.42, 0.08, 0.03, 0.12, 0.05, 0.09, 0.06]
                  E1    E2    E3    E4    E5    E6    E7    E8

After TopK(2):  [0.00, 0.42, 0.00, 0.00, 0.12, 0.00, 0.00, 0.00]

After Softmax:  [0.00, 0.57, 0.00, 0.00, 0.43, 0.00, 0.00, 0.00]
                       ▲                   ▲
                   Selected            Selected

Only experts 2 and 5 are activated, with weights 0.57 and 0.43 respectively.

Expert Specialization: Emergent Division of Labor

One of the most fascinating aspects of MoE models is that experts develop specializations without explicit supervision. Through gradient descent alone, different experts learn to handle different types of inputs.

Observed Specialization Patterns

Research has revealed consistent specialization patterns:

Syntactic experts tend to activate for specific grammatical constructions—one expert might specialize in relative clauses while another handles conditional statements.

Domain experts emerge for topics like code, mathematics, or specific languages in multilingual models.

Positional experts sometimes develop, with certain experts preferentially activating for tokens at specific positions in the sequence.

┌────────────────────────────────────────────────────────────────┐
│              Expert Specialization Visualization               │
├────────────────────────────────────────────────────────────────┤
│                                                                │
│  Input: "The quantum computer calculated the eigenvalues"      │
│                                                                │
│  Token        Expert Activations (darker = higher weight)      │
│  ─────        ──────────────────────────────────────────       │
│                E1   E2   E3   E4   E5   E6   E7   E8           │
│  "The"        [██] [  ] [  ] [░░] [  ] [  ] [  ] [  ]          │
│  "quantum"    [  ] [  ] [██] [  ] [  ] [░░] [  ] [  ]          │
│  "computer"   [  ] [  ] [██] [  ] [  ] [░░] [  ] [  ]          │
│  "calculated" [  ] [██] [  ] [  ] [░░] [  ] [  ] [  ]          │
│  "the"        [██] [  ] [  ] [░░] [  ] [  ] [  ] [  ]          │
│  "eigenvalues"[  ] [  ] [██] [  ] [  ] [  ] [░░] [  ]          │
│                                                                │
│  Legend: [██] Primary expert  [░░] Secondary expert            │
│                                                                │
│  E1: Function words    E3: Technical/scientific terms          │
│  E2: Verbs             E6,E7: Mathematics context              │
│                                                                │
└────────────────────────────────────────────────────────────────┘

The Load Balancing Challenge

A critical challenge in MoE training is expert collapse: without intervention, the router often converges to using only a small subset of experts, wasting capacity. This occurs because popular experts receive more gradient signal, improving further and attracting even more traffic—a rich-get-richer dynamic.

Load Balancing Loss

To encourage uniform expert utilization, MoE models add an auxiliary loss term:

Lbalance=αni=1nfiPi\mathcal{L}_{balance} = \alpha \cdot n \cdot \sum_{i=1}^{n} f_i \cdot P_i

Where:

  • nn is the number of experts
  • fif_i is the fraction of tokens routed to expert ii
  • PiP_i is the average router probability for expert ii
  • α\alpha is a hyperparameter (typically 0.01-0.1)

This loss is minimized when both fif_i and PiP_i are uniform across experts (equal to 1/n1/n), encouraging balanced routing.

Capacity Factor

Another mechanism involves setting a capacity factor CC that limits how many tokens each expert can process:

Expert Capacity=Cbatch_tokensn\text{Expert Capacity} = \frac{C \cdot \text{batch\_tokens}}{n}

Tokens that would exceed an expert's capacity are either dropped or routed to their second-choice expert. Typical values are C[1.0,2.0]C \in [1.0, 2.0].

Efficient Deployment: Pruning and Optimization

While MoE models offer computational savings during training and inference, their large memory footprint (all experts must be stored) poses deployment challenges. Recent research has addressed this through expert pruning and dynamic skipping.

Xie et al. (2024) introduced methods for expert-level pruning and dynamic skipping in their paper "Not All Experts are Equal: Efficient Expert Pruning and Skipping for Mixture-of-Experts Large Language Models." Their key insight is that expert importance varies significantly across layers and tasks—some experts can be removed entirely with minimal performance degradation, while others can be skipped dynamically based on input characteristics.

┌─────────────────────────────────────────────────────────────────┐
│                Expert Pruning Strategy                          │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  Original Model (8 experts per layer)                           │
│  ┌────┬────┬────┬────┬────┬────┬────┬────┐                      │
│  │ E1 │ E2 │ E3 │ E4 │ E5 │ E6 │ E7 │ E8 │  Layer 1            │
│  └────┴────┴────┴────┴────┴────┴────┴────┘                      │
│  ┌────┬────┬────┬────┬────┬────┬────┬────┐                      │
│  │ E1 │ E2 │ E3 │ E4 │ E5 │ E6 │ E7 │ E8 │  Layer 2            │
│  └────┴────┴────┴────┴────┴────┴────┴────┘                      │
│                                                                 │
│  After Importance-Based Pruning                                 │
│  ┌────┬────┬────┬    ┬────┬    ┬────┬────┐                      │
│  │ E1 │ E2 │ E3 │    │ E5 │    │ E7 │ E8 │  Layer 1 (6 experts)│
│  └────┴────┴────┴    ┴────┴    ┴────┴────┘                      │
│  ┌────┬    ┬────┬────┬    ┬────┬────┬    ┐                      │
│  │ E1 │    │ E3 │ E4 │    │ E6 │ E7 │    │  Layer 2 (5 experts)│
│  └────┴    ┴────┴────┴    ┴────┴────┴    ┘                      │
│                                                                 │
│  Memory reduction: ~30%  |  Performance retention: ~98%         │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Edge Deployment Considerations

For resource-constrained environments like edge devices, Chen et al. (2024) explored inference offloading strategies in "Efficient Inference Offloading for Mixture-of-Experts Large Language Models in Internet of Medical Things." Their work demonstrates that MoE's inherent sparsity can be leveraged for efficient distributed inference, where different experts reside on different devices and are activated on-demand.

Architectural Variations

Expert Granularity

The definition of "expert" varies across architectures:

Coarse-grained experts use full feed-forward networks (FFN) as experts. This is the classic approach used in models like Switch Transformer.

Fine-grained experts split the FFN into smaller chunks. For instance, some architectures treat each neuron group as a separate expert, enabling more precise routing.

Coarse-Grained (Classic MoE)          Fine-Grained MoE
┌─────────────────────┐               ┌─────────────────────┐
│                     │               │ ┌───┬───┬───┬───┐   │
│   Expert FFN        │               │ │ E1│ E2│ E3│ E4│   │
│   (full network)    │               │ ├───┼───┼───┼───┤   │
│                     │               │ │ E5│ E6│ E7│ E8│   │
│                     │               │ └───┴───┴───┴───┘   │
└─────────────────────┘               └─────────────────────┘
Parameters: ~100M                     Parameters: ~12.5M each

Router Variations

Beyond simple linear routers, researchers have explored:

  • Hash-based routing: Deterministic routing based on input hashing (no learned parameters)
  • Expert choice routing: Experts select tokens rather than tokens selecting experts
  • Hierarchical routing: Two-stage routing for models with many experts

Training Considerations

Batch Size Requirements

MoE models typically require larger batch sizes than dense models. With sparse routing, each expert sees only a fraction of the batch. If the batch is too small, individual experts receive insufficient gradient signal.

Communication Overhead

In distributed training, tokens must be routed to experts that may reside on different devices. This all-to-all communication can become a bottleneck:

┌─────────────────────────────────────────────────────────────────┐
│           Distributed MoE Communication Pattern                 │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│   Device 0          Device 1          Device 2          Device 3│
│  ┌────────┐        ┌────────┐        ┌────────┐        ┌───────┐│
│  │ E0, E1 │        │ E2, E3 │        │ E4, E5 │        │ E6, E7││
│  └────────┘        └────────┘        └────────┘        └───────┘│
│       ▲                 ▲                 ▲                 ▲   │
│       │                 │                 │                 │   │
│       └────────────────All-to-All────────────────────────────┘   │
│                     Communication                               │
│       ┌────────────────────────────────────────────────────┐    │
│       │              Input Batch Tokens                    │    │
│       │  (each token routed to its selected expert)        │    │
│       └────────────────────────────────────────────────────┘    │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

The Future of Sparse Models

MoE architectures represent a fundamental shift in how we think about neural network scaling. Rather than asking "how can we train larger dense models?", the question becomes "how can we allocate computation more intelligently?"

Current research directions include:

  1. Adaptive computation: Dynamically adjusting the number of experts based on input complexity
  2. Multimodal MoE: Separate expert pools for different modalities (text, image, audio)
  3. Retrieval-augmented MoE: Treating retrieved documents as additional "experts"
  4. Hardware-aware routing: Optimizing routing decisions based on device topology

As language models continue to scale, sparse architectures like MoE will likely play an increasingly central role—enabling the next generation of AI systems that are both more capable and more efficient than their dense predecessors.

Test Your Knowledge: Mixture of Experts
Question 1 of 5
What is the main advantage of Mixture of Experts architecture?

References

  1. Xie, X., et al. (2024). "Not All Experts are Equal: Efficient Expert Pruning and Skipping for Mixture-of-Experts Large Language Models." Semantic Scholar

  2. Chen, W., et al. (2024). "Efficient Inference Offloading for Mixture-of-Experts Large Language Models in Internet of Medical Things." Semantic Scholar

Share:

Related Articles

Space landscape

SPACE SERVICES