32  Mechanistic Interpretability

A trained transformer is a few hundred billion numbers that produce behavior no one wrote down. Mechanistic interpretability is the attempt to read those numbers as computation: to find the features a model represents and the circuits that connect them, the way one might reverse engineer a binary into source. By the end of this chapter a reader can explain why a single neuron rarely means one thing, why sparse dictionaries became the field’s main tool for pulling features apart, and why the question of whether that tool is the right one is still open.

32.1 Problem

The natural unit of a neural network is the neuron, and the natural hope is that each neuron means something: this one fires on Python code, that one on the color red. Early vision work found a few neurons like that. Language models do not cooperate. Probe a neuron in a transformer and it fires on a grab bag of unrelated things, a curve detector and a car wheel and a line of French at once. This is polysemanticity, and it is the wall that stops naive interpretation. If the unit you can read does not carry a single meaning, reading it tells you nothing.

The problem is not that the model has no concepts. It is that the model has far more concepts than it has neurons, and it stores them anyway. A residual stream of width \(d\) has only \(d\) orthogonal directions, yet the model seems to track tens of thousands of distinct features. Something has to give. The features cannot each get a private direction, so they share. The job of interpretability is to undo that sharing: to recover the concepts the model uses from the activations that smear them together, and then to show how those concepts feed each other to produce an output. The payoff is the input to Chapter 33: oversight that can only watch behavior is blind to a model that behaves until it does not, whereas oversight that can read features has a chance of catching a plan before it executes.

32.2 Design

The core idea has two halves, a hypothesis about how models store features and a method for recovering them.

The hypothesis is superposition. A model represents more features than it has dimensions by assigning each feature a direction in activation space and letting those directions overlap. This works because features are sparse: on any given token, only a handful of the thousands of possible features are active, so the model can tolerate interference between directions that are rarely on at the same time. Elhage et al. (2022), in “Toy Models of Superposition,” built the cleanest demonstration: a tiny autoencoder forced to pack \(n\) sparse features into fewer than \(n\) dimensions does exactly this, arranging the feature directions into regular geometric structures and trading reconstruction error against interference as a function of sparsity. Polysemanticity, in this picture, is not a defect. It is the visible shadow of a model using superposition to fit more features than it has neurons.

If features live along directions rather than on neurons, the right unit of analysis is a direction, and the right tool is one that finds the directions. That tool is dictionary learning, implemented as a sparse autoencoder (SAE). An SAE takes a model activation \(x \in \mathbb{R}^{d}\) and learns an overcomplete dictionary of \(m \gg d\) feature directions, encoding \(x\) as a sparse vector of feature activations and decoding it back:

\[ f = \sigma(W_{\text{enc}} x + b_{\text{enc}}), \qquad \hat{x} = W_{\text{dec}} f + b_{\text{dec}} \]

trained to minimize reconstruction error plus a sparsity penalty so that only a few of the \(m\) features fire on any input. The columns of \(W_{\text{dec}}\) are the recovered feature directions. Because \(m\) exceeds \(d\), the dictionary is overcomplete: it has room to give each underlying feature its own slot even though the model had to pack them into \(d\) crowded dimensions. The features the SAE recovers are far more often monosemantic than the raw neurons, which is the whole point.

Features alone are nouns. To explain behavior you need verbs, the connections that carry one feature’s activation into another. That is the circuit. A circuit is a subgraph of the model’s computation, a small set of features and the weights between them that together implement a recognizable algorithm. The earliest worked example outside vision was the induction head, a two-attention-head circuit that implements the rule “if the token \(A\) was followed by \(B\) earlier in the context, and I just saw \(A\) again, predict \(B\).” Olsson et al. (2022) traced this circuit and tied its formation to a phase change in training that coincides with the onset of in-context learning. The agenda of the field is to do for arbitrary behavior what induction heads did for copying: name the features, name the wires, check the explanation by intervening.

32.3 Evolution

The lineage runs from hand-traced circuits to learned features to causal graphs, each step forced by the limits of the last.

It began in vision. Olah et al., in the Distill thread “Zoom In: An Introduction to Circuits” (2020), argued that features are real and circuits connecting them are real, and showed both by hand in convolutional networks. Carrying this to transformers needed an algebra for the architecture. Elhage et al. (2021), “A Mathematical Framework for Transformer Circuits,” supplied it: the residual stream as a shared communication channel that every layer reads from and writes to, attention heads as operations that move information between positions, and the composition rules that let heads form multi-step circuits. Induction heads were the first big result that framework explained.

Then the wall: hand-tracing does not scale, and superposition means the neurons you would trace are polysemantic anyway. Elhage et al. (2022) named the obstacle. The escape was to stop reading neurons and start learning features. Cunningham et al. (2023), “Sparse Autoencoders Find Highly Interpretable Features in Language Models” (arXiv:2309.08600), and Anthropic’s “Towards Monosemanticity” (Bricken et al., 2023) landed the same idea at the same time: train an SAE on a model’s activations and the recovered dictionary is far more interpretable than the neuron basis. Bricken et al. worked on a one-layer transformer with a 512-neuron MLP and showed that dictionary features picked out crisp concepts, such as Arabic script or DNA sequences, that no single neuron cleanly held.

The open question was whether this survived scale. “Scaling Monosemanticity” (Templeton et al., 2024) carried SAEs up to Claude 3 Sonnet, a production model, training dictionaries on its middle-layer residual stream and recovering abstract, multilingual, multimodal features, including the Golden Gate Bridge feature that, when clamped on, produced the “Golden Gate Claude” that steered every conversation toward the bridge. The intervention mattered more than the demo: it showed the features were causal, not just correlated. In parallel, the engineering of SAEs tightened. Gao et al. (2024), “Scaling and Evaluating Sparse Autoencoders” (arXiv:2406.04093), replaced the soft L1 penalty with a top-\(k\) activation that fixes the number of live features directly, which removed a tuning headache, cut the population of dead latents, and produced clean scaling laws relating dictionary size and sparsity to reconstruction.

The latest step puts the nouns and verbs back together. Anthropic’s 2025 work, “Circuit Tracing” (Ameisen et al.) and “On the Biology of a Large Language Model” (Lindsey et al.), builds attribution graphs: it replaces a model’s MLPs with cross-layer transcoders, sparse interpretable features that read from one layer and write to all later ones, then traces the linear causal pathways through those features for a single prompt. The result is a graph whose nodes are features and whose edges are causal influences, a local circuit diagram for one input. Studied on Claude 3.5 Haiku, these graphs surfaced multi-step reasoning, planning ahead in rhymes, and shared machinery across languages.

32.4 Trade-offs

Every gain in this chapter is bought with a matching cost.

  • Monosemanticity versus completeness. Sparsity buys clean, single meaning features, but a dictionary recovers only the features it was sized and trained to find. Push sparsity too hard and features split or get absorbed into each other; push too soft and they go polysemantic again. The SAE reconstructs the activation imperfectly, and the gap, the reconstruction error, is exactly the part of the model’s computation the dictionary failed to explain.
  • Faithfulness versus interpretability. A transcoder or SAE that replaces part of the model is interpretable only to the degree it stays faithful to what the original computed. Attribution graphs make this trade explicit: they freeze attention and approximate MLPs, so the graph is a readable story about a model that is not quite the one you deployed.
  • Local versus global. An attribution graph explains one prompt. A feature dictionary describes one model at one training checkpoint. Neither yet gives a global, reusable account of what the model does on inputs you have not traced, which is what oversight ultimately wants.
  • Effort versus coverage. Reading a circuit is slow human work. Auto-interpretation, using a model to label features, scales the labor but imports the labeler’s errors, and the field has no settled metric for when a feature has been correctly understood.
ImportantWhat’s contested

Whether the sparse autoencoder is the right interpretability primitive, or an artifact-prone detour, is actively debated. The case against has several named results. Kantamneni et al. (2025), “Are Sparse Autoencoders Useful? A Case Study in Sparse Probing” (arXiv:2502.16681), find that on downstream probing tasks SAE features rarely beat simple baselines once those baselines are made strong, which questions the practical payoff. Chanin et al. (2024) document feature absorption, where a general feature like “starts with L” goes silent on a specific token like “lion” because a more specific feature swallowed it, so the dictionary’s apparent cleanliness is partly an artifact of the objective. Others argue the L1 sparsity penalty pushes SAEs to learn frequent combinations rather than the atomic features that actually mediate computation, and that transcoders outperform SAEs on interpretability (arXiv:2501.18823). The defenders, centered on Anthropic’s interpretability team and the authors of the scaling and attribution work, counter that SAEs are improvable engineering, not a dead end, that causal interventions like Golden Gate Claude show the features are real, and that attribution graphs already produce mechanistic accounts that bare probing cannot. The disagreement is not whether superposition is real. It is whether dictionary learning is how we should be reading it.

TipConstraint arrow

Superposition is a fact about the architecture in Chapter 6, and it dictates the shape of every tool in this chapter. Because the residual stream packs more features than it has dimensions, no method that reads neurons directly can work, and the field is forced into overcomplete dictionary learning. A property of how the model stores information sets the entire method stack of how we read it.

32.5 Implementation

The working loop is mechanical and worth seeing in outline. Collect a large sample of activations at one site in the model, the residual stream after a chosen layer or an MLP’s hidden activations. Train the dictionary on them. With a top-\(k\) encoder the forward pass is small:

# x: model activation, shape (batch, d)
pre = x @ W_enc + b_enc            # (batch, m), m >> d
f   = topk(pre, k)                  # keep k largest, zero the rest
x_hat = f @ W_dec + b_dec           # reconstruct
loss = mse(x_hat, x)                # no separate L1 term: k fixes sparsity

The fixed \(k\) removes the sparsity coefficient that the L1 formulation forces you to tune, and is the practical reason top-\(k\) SAEs displaced the older soft penalty. After training, each dictionary column is a candidate feature. You interpret it by gathering the inputs that activate it most, reading the pattern, and, crucially, testing the reading by clamping the feature on or off and watching the output move. A feature you cannot steer with is a feature you have not understood. Attribution graphs extend the same discipline to a whole computation: swap the MLPs for trained transcoders, run one prompt, and read off the causal graph among active features, then verify edges by perturbing nodes and confirming the downstream effect.

The failure modes are specific. Dead latents are dictionary features that never fire and learn nothing, wasting capacity. Dead and over-firing latents together push you toward top-\(k\) and careful initialization. Feature splitting hands you several narrow features where the model had one, and absorption hides a general feature inside a specific one, so the dictionary looks cleaner than the model is. And the reconstruction error is not noise to be ignored: it is the unexplained residue, the part of the computation your dictionary did not capture, and a method that reports high interpretability while leaving large error has explained less than it claims.

32.6 Further reading