37 The Economics of AI

The whole stack is a way to spend money, and the bill has a shape. By the end of this chapter a reader can explain where compute is bought, why training and inference are two different kinds of cost, when to build a model versus buy one through an API, and why the claim that inference dominates the lifetime bill reaches back up into the architecture and scaling choices of Chapter 3.

37.1 Problem

Every decision in the earlier chapters resolves, eventually, into a number on an invoice. A parameter count is a training bill. An attention variant is a key-value cache, which is memory rented by the hour. A sampling budget is a column on a serving ledger. The problem this chapter owns is the one a reader has been able to feel but not yet name: given that compute is the scarce input, how is it priced, how does the price split between building a model and running it, and how does that split feed back into what you should build.

Three constraints make the question hard. Compute is rented in a market with volatile prices, so the cost of a FLOP is a moving target, not a constant. Training is paid once and inference is paid forever, so the two costs grow on different clocks. And the cheapest answer is often to buy capability from someone else, which means the build decision competes against a falling market price you do not control.

37.2 Design

The organizing idea is that an AI system has two cost structures, not one, and they behave differently enough that conflating them produces wrong decisions.

Training is a capital expense. It is large, lumpy, and paid before any value is returned: you commit a cluster for weeks, spend a fixed sum, and get a set of weights. The spend is dominated by the final run plus the research and ablation runs that found the recipe. Per Epoch AI’s analysis of roughly 45 frontier models, the amortized hardware and energy cost of the final training run has grown about 2.4 times per year since 2016, with hardware accounting for the largest single share, R&D staff a comparable share, and energy a small remainder. The headline figures are dated and directional: GPT-4 around 40 million dollars, Gemini Ultra around 30 million dollars, and, on that trend, the largest runs crossing a billion dollars around 2027.

Inference is an operating expense. It is small per request and paid every time the model is called, so it scales with usage rather than with the training budget. A useful first-order model of its unit cost is

\[ \text{cost per token} \approx \frac{\text{accelerator } \$/\text{hour}} {\text{throughput (tokens/hour)} \times \text{utilization}} \]

which says the three levers on inference cost are the rented price of the accelerator, the throughput the serving stack extracts from it, and how fully the hardware is kept busy. Every chapter in Part VI is, read through this equation, an attack on the denominator: Chapter 17 raises utilization by batching, Chapter 18 and Chapter 19 raise throughput, and the attention variants of Chapter 7 shrink the per-token work.

The compute market sets the numerator. Accelerators are rented by the hour across hyperscalers and a tier of specialist GPU clouds, with three price levels: reserved capacity at the lowest committed rate, on-demand at a premium for flexibility, and spot at the deepest discount in exchange for preemption. As of 2025 and into 2026 an H100 rents on demand for low single-digit dollars per GPU-hour, and far less on spot, after large provider price cuts through 2025 loosened a supply crunch. The directional fact matters more than any single quote: the price of a FLOP falls over time, and a serving plan written against last year’s price is too pessimistic by a margin that compounds.

37.3 Evolution

The build-versus-buy decision has inverted over the life of the field, and the inversion is the most consequential economic shift in this chapter.

Early on, capability lived only inside a few labs, so building was the only way to get a frontier model and buying was not an option. As hosted APIs matured, buying became possible, and then a second curve took over: the price of a token of inference has fallen by more than an order of magnitude per year. A capability that cost tens of dollars per million tokens at the GPT-4 launch in 2023 costs a small fraction of a dollar for comparable quality a few years later. The release of strong open-weight models, DeepSeek-V3 among them, accelerated this by giving every closed provider a low-priced reference point to justify against, and triggered repeated public price cuts across the market.

Two curves now run in opposite directions. Training cost rises about 2.4 times per year at the frontier; API inference price falls by roughly an order of magnitude per year. The scissors between them is why the default answer drifted from build to buy. Building a frontier model from scratch competes against a market price that is collapsing under you, so building only pays when something the market cannot sell you is at stake: proprietary data that cannot leave your boundary, a latency or availability floor the API cannot meet, a customization the hosted model will not accept, or an inference volume large enough that self-hosting beats the per-token markup.

The arithmetic of building has itself moved. DeepSeek-V3, a 671-billion parameter mixture-of-experts model, reports a final training run of about 2.788 million H800 GPU-hours, roughly 5.6 million dollars at an assumed two dollars per GPU-hour. The number is striking, and the report is careful to say what it excludes: the prior research, the architecture and data ablations, and everything that is not the one final run. The lesson for a build decision is that the headline training number is the visible tip; the research that found the recipe is the submerged cost, and it does not appear on the run’s invoice.

What’s contested

Whether inference dominates the lifetime bill is the live debate, not a settled fact. The standard case is strong: training is paid once, inference is paid per request across the model’s whole deployed life, and for a widely-served model the operating expense overtakes the capital expense by a large margin. But reasoning models and test-time compute, the subject of Chapter 15, complicate the balance from the inference side rather than resolving it. A single request that spends a long internal chain of thought can cost orders of magnitude more than a direct answer, which inflates the per-request cost dramatically and pushes more of the total spend back toward inference, not toward training. Whether the dominant cost is many cheap calls or fewer expensive ones is unsettled and product specific. Treat the split as a measured property of a given deployment, not a universal ratio.

Constraint arrow

Chapter 3 already named the arrow that runs from serving cost up into model sizing: a high lifetime inference volume justifies over-training a smaller model. The economics close a second arrow that runs the other way. The unit cost of a served token, set by the market price of the accelerator divided by the throughput of the serving stack, is what sets the break-even volume in the build-versus-buy decision. Below that volume the falling API price wins and you buy; above it the per-token markup you pay a provider exceeds your own marginal cost and you build. The inference equation, not a strategy preference, is what decides which side of the line a workload sits on.

37.4 Trade-offs

Each economic choice is a balance with a knee.

Capital versus operating cost. Spending more to train, by over-training a smaller model past Chinchilla-optimal in the sense of Chapter 3, lowers every future inference bill. The trade pays only if the lifetime token volume is large enough to amortize the extra training spend, so the right point depends on a usage forecast you make before you have the usage.
Build versus buy. Buying through an API converts a capital expense into a per-token operating expense and rides the falling market price, at the cost of control over data, latency, and customization. Building converts it back into a capital expense plus a fixed serving footprint, which wins only past a break-even volume or where a non-price constraint forces it.
Reserved versus spot capacity. Reserved compute is cheaper per hour and predictable, which fits steady production serving. Spot is cheaper still but can be preempted, which fits interruptible work: batch inference, checkpointed training, hyperparameter sweeps. Matching the workload’s interruption tolerance to the pricing tier is most of the savings.
Quality versus unit cost. A larger or reasoning-heavy model answers better and costs more per request. The economically right model is the smallest one that clears the task’s quality bar, which is why routing a workload across a ladder of model sizes often beats sending everything to the largest.

37.5 Implementation

The economics become operational through two small calculations a team actually runs.

The first is the inference unit cost above. To turn it into a number you need the accelerator’s hourly price from the compute market, the serving stack’s measured throughput in tokens per hour, and the realized utilization. The same equation explains why the optimizations of Part VI pay for themselves directly: doubling throughput or utilization halves the cost per token, and the per-token cost is what every downstream margin is computed against.

The second is the build-versus-buy break-even. Let $p$ be the API price per token, $c$ your own marginal cost per token from the equation above, and $F$ the fixed cost of standing up and reserving your own serving capacity. Buying costs $p \times V$ for a volume $V$; building costs $F + c \times V$. Building wins once

\[ V > \frac{F}{p - c} \]

which exists as a positive threshold only when your marginal cost $c$ is below the API price $p$. The break-even moves against you over time because $p$ falls with the market, so a build that pencils out today must be re-checked against next year’s lower API price before the capital is committed.

The failure modes are economic, not numerical. Building against a falling buy price is the classic one: a self-hosted plan justified at this year’s API price can be underwater a year later, and the capital is already spent. Provisioning steady reserved capacity for spiky demand wastes the trough; serving spiky demand on pure on-demand or spot pays a premium or risks preemption at the peak. And forecasting inference volume too low at training time under-trains a model that will be served far more than expected, leaving a permanent per-token cost on the table that no serving optimization fully recovers.

37.6 Further reading

Cottier et al., “How much does it cost to train frontier AI models?” (Epoch AI), 2024. epoch.ai
Sardana et al., “Beyond Chinchilla-Optimal: Accounting for Inference in Language Model Scaling Laws,” 2024. arXiv:2401.00448
Hoffmann et al., “Training Compute-Optimal Large Language Models” (Chinchilla), 2022. arXiv:2203.15556
DeepSeek-AI, “DeepSeek-V3 Technical Report” (training cost and GPU-hour accounting), 2024. arXiv:2412.19437
Cottier et al., “The rising costs of training frontier AI models” (Epoch AI dataset), 2024. arXiv:2405.21015