7 Beyond Dense Transformers: MoE, SSMs, Hybrids

The dense transformer of Chapter 6 runs every token through every parameter. This chapter is about the two ways the frontier breaks that contract: a mixture-of-experts layer that routes a token to a few of its many experts, and a state-space model that replaces attention’s quadratic mixing with a linear recurrence. By the end a reader can explain why sparse models grow their parameter count almost for free, what the router has to get right for that bet to pay off, and where state-space and hybrid stacks sit as the live alternative to attention.

7.1 Problem

A dense feed-forward block runs every token through all of its parameters. That couples two things a frontier builder wants to separate: the model’s capacity, how much it can know, and its per-token compute, how much each forward and backward step costs. Under that coupling, the only way to add knowledge is to add FLOPs, and FLOPs are the budget Chapter 3 spends. A model that wants more capacity than its compute budget can afford has no move inside the dense design.

The second problem is attention’s own. Self-attention mixes every token with every other, so its cost grows with the square of sequence length, and the key-value cache it leaves behind grows linearly with context and dominates serving memory (Chapter 16). For long context that quadratic is the wall. The dense transformer therefore carries two separate inefficiencies: a feed-forward block that spends compute on capacity it may not need per token, and an attention block that spends compute and memory on a mixing pattern that may be denser than the task requires. MoE attacks the first. State-space models attack the second.

7.2 Design

The mixture-of-experts answer is to decouple parameters from active FLOPs. An MoE layer holds \(N\) expert feed-forward networks but routes each token to only \(k\) of them, with \(k \ll N\). The identity that matters: total parameters scale capacity and knowledge, while active parameters, the \(k\) chosen experts plus the always-on parts of the model, scale per-token compute and step cost. Capacity becomes cheap relative to FLOPs, which is the whole frontier appeal. The same arithmetic cuts inference FLOPs too, though serving an MoE is its own problem and out of scope here.

The new learned object is the router, also called the gate. It is a small linear map from a token’s hidden state to one score per expert. A softmax or sigmoid turns scores into weights, the top-\(k\) experts are selected, and their outputs are combined weighted by the renormalized gate values. In sketch form:

# router: a linear map h -> scores over N experts
scores = h @ W_gate                  # (tokens, N)
weights = softmax(scores)            # or sigmoid per-expert
top = topk(weights, k)               # indices and gate values
y = sum(g_i * Expert_i(h) for i, g_i in top)

This tiny map is the only genuinely new component, and it is the source of almost every MoE pathology. The whole rest of the design exists to keep the router honest.

The first thing the router gets wrong on its own is balance. Left alone, the gate collapses onto a few favored experts in a winner-take-all loop: an expert that is chosen gets more gradient, improves, and is chosen more, while the rest starve and become dead parameters. Two families of fix exist. The auxiliary load-balancing loss, from Shazeer et al. (2017) and refined in GShard and Switch, adds a penalty term on the product of the fraction of tokens routed to each expert and the gate mass placed on it, pushing the distribution toward uniform. It works, but it injects a loss whose weight trades balance against task quality. The newer answer, loss-free or bias-based balancing from DeepSeek-V3, drops the auxiliary loss entirely. Instead it keeps a per-expert routing bias, nudged up for under-used experts and down for over-used ones between steps, steering selection without adding an interfering gradient. That sidesteps the balance-versus-quality tension the auxiliary loss creates.

Two refinements change what an “expert” is. Fine-grained experts slice each expert into smaller ones and raise \(k\) proportionally, so the same active FLOPs buy far more routing combinations and sharper specialization. Shared experts are a few that every token always passes through, absorbing the common, domain-general computation so the routed experts can specialize instead of each re-learning the basics. DeepSeekMoE combines both, and the pair is now a common default for large open MoE models.

One systems fact intrudes on this otherwise clean picture. Routing is dynamic, but hardware wants fixed-shape buffers, so each expert is given a capacity cap, roughly capacity_factor * (tokens / experts). Tokens routed to an over-full expert are dropped, passed through on the residual or rerouted to a second choice. The capacity factor is a direct knob: set it low and you save memory and communication but drop more tokens, losing compute and adding training noise; set it high and you waste padding on under-full experts.

State-space models answer the other problem, attention’s quadratic. An SSM maintains a fixed-size hidden state and updates it token by token through a linear recurrence, then reads an output from it:

\[ h_t = A\, h_{t-1} + B\, x_t, \qquad y_t = C\, h_t \]

Because the state is fixed in size, the cost is linear in sequence length and there is no growing key-value cache to serve, only a constant-size state. The price is that all of a token’s past must be summarized into that fixed state, where attention keeps every past token addressable. The design question is whether a learned, input-dependent recurrence can compress the past well enough to compete. Mamba’s contribution was to make the recurrence selective, letting \(A\), \(B\), and \(C\) depend on the input so the model can choose what to keep and what to forget, recovering much of what content-based attention does while staying linear.

In practice the frontier does not choose one or the other. Hybrid stacks interleave a minority of full-attention layers with a majority of SSM or linear-recurrence layers, so a few global-mixing layers preserve exact recall while the linear layers carry the bulk of the sequence cheaply. MoE and SSM are also orthogonal: a hybrid can put its feed-forward blocks behind an MoE router and its sequence-mixing behind a state-space recurrence at the same time.

7.3 Evolution

Sparse experts predate the transformer era as a general idea, but the line that matters here begins with Shazeer et al. (2017), who introduced top-\(k\) gating and the auxiliary balancing loss inside an LSTM language model. GShard (Lepikhin et al., 2020) carried the layer into the transformer and into automatic sharding across devices. Switch Transformers (Fedus et al., 2021) pushed to the cheapest possible router, \(k=1\), and to trillion-parameter scale, showing top-1 routing could train stably with the right tricks. GLaM (Du et al., 2022) made the efficiency case at scale, and Mixtral (Jiang et al., 2024) brought a strong open \(k=2\) model into wide use.

Two threads then refined the layer. On routing, expert-choice (Zhou et al., 2022) inverted the question: instead of each token picking its experts, each expert picks its tokens, which makes per-expert load exact by construction and removes dropping, at the cost of some tokens getting no expert at all. BASE Layers (Lewis et al., 2021) framed routing as a balanced assignment problem outright. On specialization, DeepSeekMoE (Dai et al., 2024) introduced fine-grained and shared experts, and DeepSeek-V3 (2024) replaced the auxiliary loss with bias-based balancing at frontier scale, with the standalone method documented separately by Wang et al. (2024).

A parallel thread made MoE practical without paying for a from-scratch run. Sparse upcycling (Komatsuzaki et al., 2022) initializes each expert as a copy of a trained dense feed-forward block, adds a fresh router, and continues training; the experts diverge as routing specializes them. It buys strong quality for a fraction of from-scratch compute, with the failure mode that experts can stay near-identical if routing pressure is too weak, leaving MoE’s cost without its benefit.

The state-space line is younger as a serious attention competitor. The structured state-space model S4 (Gu et al., 2021) showed a carefully parametrized linear recurrence could handle very long sequences, and Mamba (Gu and Dao, 2023) made the recurrence selective and hardware-efficient, bringing it within reach of language modeling at scale. Mamba-2 (Dao and Gu, 2024) tied state-space models back to attention through a duality that let them reuse attention’s hardware. The hybrid answer arrived quickly: Jamba (Lieber et al., 2024) interleaved Mamba and attention layers and added MoE on top, demonstrating that the three ideas compose into one stack.

7.4 Trade-offs

Sparse versus dense. MoE buys more capacity per training FLOP and per active FLOP, but pays in memory, since all experts are stored and sharded, in communication, since expert parallelism needs an all-to-all (Chapter 8), in fragility, and in a harder serving story. At small scale or under tight memory, dense wins. The crossover favors MoE as total scale grows.
top-1 versus top-k. A router with \(k=1\) is cheapest and simplest but more brittle and less expressive; \(k=2\) is the common sweet spot; higher \(k\) approaches dense cost and erodes the advantage.
Capacity factor. Lower is cheaper but drops more tokens and adds noise; higher wastes padding. There is no single right value, and it is often raised at evaluation and inference, where dropping a token is unacceptable.
Attention versus state-space. Attention keeps every past token exactly addressable and excels at precise recall, at quadratic cost and a growing key-value cache. An SSM is linear in length with a constant-size state, but must compress the past, and tends to lag attention on tasks that need exact long-range lookup. Hybrids exist precisely because neither pure form dominates.
Train and serve complexity. Every choice here adds operational surface: a router to tune, balance to monitor, expert placement to lay out, and a memory and communication profile that Chapter 8 must absorb. The quality win is real but never free engineering.

What’s contested

Whether state-space and hybrid models should displace the dense attention transformer is not settled. The case for them is the linear cost and the constant-size state, which matter most at long context and at serving time. The case against is recall: pure SSMs still trail attention on tasks that need exact retrieval from far back in the sequence, which is why every strong long-context system so far keeps some full-attention layers rather than going pure-recurrent. How few attention layers a stack can keep, and whether a future selective recurrence closes the recall gap entirely, are open questions. Treat “attention is replaceable” as a live hypothesis the frontier is testing, not a settled result.

Constraint arrow

The serving layer reaches up and sets sequence-mixing design. The key-value cache footprint and the quadratic cost of attention at long context (Chapter 16) are what make a linear-cost, constant-state recurrence worth its recall penalty in the first place. A choice that looks like pure architecture, attention versus state-space, is driven by what inference will cost over the model’s life, the same arrow that justified over-training in Chapter 3.

7.5 Implementation

The router is the part to get right, and its instabilities have a specific cause. The top-\(k\) selection is an argmax, and argmax is discontinuous: a small change in scores flips a token’s expert, making the loss landscape jagged and spike-prone, worse in low precision. The standard mitigations are narrow and worth naming. The router z-loss from ST-MoE (Zoph et al., 2022) penalizes large router logits to keep the gate numerically well-behaved; note this is distinct from the output-logit z-loss in the stability kit of Chapter 3. Keeping the router in fp32 under a bf16 or fp8 body costs little and removes a class of spikes. And early routing decisions can lock in before experts have differentiated, so a run is worth watching for that in its first phase. MoE loss curves are spikier than dense curves at the same scale; an unguarded router, with no z-loss and bf16 logits, can spike and fail to recover days into a run.

The failure modes are worth naming because each bites at a different time. Routing collapse funnels most tokens to a few experts and leaves the rest dead; it is exactly what load balancing exists to prevent, and it returns if balance is mis-weighted or disabled. Load imbalance and token dropping, short of collapse, overflows some experts’ capacity and silently discards their tokens’ compute while adding noise, and it stalls the slowest expert’s device, a systems cost paid in Chapter 8. Expert under-training leaves rarely-chosen experts too few tokens to specialize, the exact risk mode of upcycling when routing pressure is too weak. For state-space and hybrid stacks the implementation reality is younger and thinner: the recurrence needs a hardware-efficient scan to be fast, the few attention layers in a hybrid still carry a key-value cache, and the tooling around these models is less mature than the dense-transformer path, which is itself a reason the frontier has been slow to leave attention behind.

7.6 Further reading

Shazeer et al., “Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer,” 2017 (ICLR). Origin of top-k gating and the auxiliary balancing loss. arXiv:1701.06538
Lepikhin et al., “GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding,” 2020. arXiv:2006.16668
Fedus et al., “Switch Transformers,” 2021 (JMLR 2022). Top-1 routing at trillion-param scale. arXiv:2101.03961
Du et al., “GLaM: Efficient Scaling of Language Models with Mixture-of-Experts,” 2022 (ICML). arXiv:2112.06905
Zhou et al., “Mixture-of-Experts with Expert Choice Routing,” 2022 (NeurIPS). Balance by construction. arXiv:2202.09368
Zoph et al., “ST-MoE: Designing Stable and Transferable Sparse Expert Models,” 2022. Router z-loss, stability. arXiv:2202.08906
Komatsuzaki et al., “Sparse Upcycling: Training Mixture-of-Experts from Dense Checkpoints,” 2022 (ICLR 2023). arXiv:2212.05055
Jiang et al., “Mixtral of Experts,” 2024. arXiv:2401.04088
Dai et al., “DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models,” 2024. Fine-grained + shared experts. arXiv:2401.06066
DeepSeek-AI, “DeepSeek-V3 Technical Report,” 2024. Auxiliary-loss-free (bias-based) load balancing at scale. arXiv:2412.19437
Lewis et al., “BASE Layers: Simplifying Training of Large, Sparse Models,” 2021 (ICML). Routing as assignment, for contrast. arXiv:2103.16716
Wang et al., “Auxiliary-Loss-Free Load Balancing Strategy for Mixture-of-Experts,” 2024. The standalone loss-free balancing method, distinct from the DeepSeek-V3 report. arXiv:2408.15664
Gu et al., “Efficiently Modeling Long Sequences with Structured State Spaces” (S4), 2021. arXiv:2111.00396
Gu and Dao, “Mamba: Linear-Time Sequence Modeling with Selective State Spaces,” 2023. arXiv:2312.00752
Dao and Gu, “Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality” (Mamba-2), 2024. arXiv:2405.21060
Lieber et al., “Jamba: A Hybrid Transformer-Mamba Language Model,” 2024. arXiv:2403.19887