35 The Model Landscape
The labs that train frontier models differ less in what they build than in how much of it they show. By the end of this chapter a reader can place any model on the open-to-closed spectrum, name the five things a release can disclose, read a weight license for what it permits, and explain why almost every published training practice traces to a handful of open labs rather than the largest closed ones.
This chapter is adapted from a dated snapshot that named specific competitor labs and ranked their transparency. It has been rewritten to a principle-first framing: it describes categories of openness and training practice, refers to organizations only as their already-public artifacts require, and drops per-organization compute figures and confidence-tier rankings. Re-read this chapter against the current landscape before the book is made public.
35.1 Problem
A practitioner choosing a model, or trying to learn from one, has one axis that predicts almost everything else: openness. Openness decides what you can run, what you can adapt, what you can reproduce, and what you can only guess at. Two models can score the same on a benchmark and sit at opposite ends of this axis, and that difference governs whether you can fine-tune the weights, audit the data, or merely call an endpoint.
The axis matters most where it is most lopsided. The most detailed public knowledge of how frontier models are actually trained does not come from the most capable systems. For the largest closed systems the load-bearing numbers, parameter count, token count, per-run compute, training framework, are not disclosed at all. Anyone reasoning about the closed frontier is extrapolating from the open one. That single fact makes the open labs the field’s primary teaching material, and it is why this chapter is organized around what a release shows rather than around a ranking of who is ahead.
35.2 Design
Openness is not one bit but several, and a model can be open on some and closed on others. A release can disclose, in roughly increasing order of generosity:
- Weights. The trained parameters, downloadable and runnable. This is the line most people mean by “open model.”
- Code. The training and inference code, so the architecture is legible and the run is in principle re-executable.
- Data. The pre-training corpus and mixture, so results can be reproduced and audited rather than trusted.
- Checkpoints. Intermediate states across the run, which turn a single artifact into a study of training dynamics.
- Logs. The dataloader order, hyperparameter schedule, and run telemetry, the raw material for genuine reproducibility.
These define a spectrum rather than a binary. At one end sit fully open efforts that publish all five, weights, data, code, thousands of intermediate checkpoints, and the exact data order, built so that training dynamics can be studied and the run reconstructed. In the middle sit open-weights releases that ship weights and a detailed technical report but withhold the data: you can run and fine-tune the model and read how it was built, but you cannot reproduce it. At the closed end sit systems that disclose a methods direction or a systems stack but never the scale numbers, and beyond them systems that disclose effectively nothing about training at all.
Licensing is a second, legal layer sitting on top of the weights. Open weights under a permissive license such as Apache 2.0 or MIT can be run, modified, and redistributed commercially. Open weights under a custom or source-available license may restrict use by company size, use case, or redistribution. No-weights models can only be reached through an API. The license, not the download link, is what decides whether a weight release can anchor a commercial ecosystem of fine-tunes and derivatives.
flowchart LR
A[Fully open] -->|weights plus data plus code plus checkpoints plus logs| B[Open weights]
B -->|weights plus report, data withheld| C[Methods disclosed]
C -->|systems stack named, scale withheld| D[Closed]
D -->|API only, training undisclosed| E[API only]
35.3 Evolution
Two movements shaped the present landscape.
The first is the rise of the open-weights release as a deliberate strategy rather than an act of charity. Early releases of large model weights came with full technical reports, and the most complete single training papers in the field belong to this open-weights tier. Releasing weights under a permissive license seeds an ecosystem of fine-tunes, derivatives, and tooling that a closed API cannot, and it turns external developers into an extension of the lab. The open-weights tier is now where the richest method disclosures appear, including the first public demonstrations of frontier-scale techniques that the closed labs presumably use but do not document.
The second is the shift in where training happens. The earlier pattern was renting capacity from hyperscale clouds; the current pattern is building dedicated, multi-gigawatt, custom-powered campuses, often through multi-party financing vehicles. This raises the capital floor for training at the frontier and concentrates it among organizations that can fund power and silicon at industrial scale. The hardware itself runs a few generations in parallel: current-generation accelerators carry new runs while the prior generation, still the bulk of deployed capacity, carries the rest. Custom silicon now sits alongside the dominant GPU line, and the most capable closed labs spread runs across more than one accelerator family rather than betting on a single vendor. These figures move fast and are the most perishable claims in the area, which is one reason this chapter states the trend and not the numbers.
The training method itself has a published arc, traceable through open releases: supervised fine-tuning, then preference optimization through reinforcement learning from human feedback, then its more direct successors, then reinforcement learning with verifiable rewards for reasoning. That arc is covered in Chapter 10, Chapter 11, and Chapter 14; what matters here is that each step of it entered the public record through a release that chose to document it.
35.4 Trade-offs
Openness is a balance struck differently by the lab and by the practitioner.
- For the lab: control versus ecosystem. Closing weights keeps a capability proprietary and metered behind an API, protects against misuse and competitive copying, and keeps the training recipe private. Opening weights forgoes that control in exchange for adoption, external contribution, and standard-setting, and it converts a model into a platform others build on.
- For the practitioner: run versus reproduce. Open weights with a closed dataset let you run, fine-tune, and serve a model, but not reproduce or fully audit it. You inherit whatever is in the data without being able to inspect it. Fully open models trade some frontier capability for the ability to study and verify every stage.
- Licensing: permissive versus restricted. Apache or MIT weights can anchor a commercial derivative ecosystem. Custom licenses that gate by company size or use case can foreclose exactly the commercial uses a team needs, and a model that is “open” in the press release may be unusable for a given product once the license is read. No-weights models foreclose the derivative ecosystem entirely.
Whether releasing open weights is net-beneficial is genuinely unsettled. One position holds that open weights are essential: they enable independent safety research, reproducibility, auditing, and a competitive ecosystem that prevents capability from concentrating in a few closed providers. The opposing position holds that releasing frontier weights is irreversible proliferation: once weights are public they can be fine-tuned to remove safety training and cannot be recalled, so the most capable weights should stay closed. Both positions are argued in good faith by serious researchers, and the line of which capabilities are safe to release openly moves with the frontier. This is a values-and-evidence debate, not a settled technical question.
The license, a legal artifact, dictates what the downstream ecosystem can build. A permissive open-weights license is what makes a commercial ecosystem of fine-tunes, distillations, and derivative products possible; a restrictive or no-weights license forecloses it regardless of how capable the model is. A decision that looks like pure model quality is, at the ecosystem layer, set by the text of the license attached to the weights.
35.5 Implementation
The practical payoff of reading the landscape is the catalog of transferable practices the open tier has published. Almost every training “trick” has a citable origin in a public report, and the deep treatment of each lives in an earlier chapter. The pattern to notice is that this catalog is buildable at all only because of the open-weights and fully-open tiers; the closed frontier contributes little to it.
- FP8 training at frontier scale, with sensitive operations kept in higher precision: the first public demonstration of sub-0.25% relative-loss low-precision training. See Chapter 19 and 2.
- Auxiliary-loss-free MoE load balancing via per-expert bias terms, with the caveat that a tiny sequence-wise balance loss is retained, so the name slightly overstates it. See Chapter 7.
- Multi-head latent attention (low-rank key-value compression), multi-token prediction (a denser training signal, reusable for speculative decoding), and grouped-query attention (near-MHA quality at near-MQA cost, uptrainable from MHA). See Chapter 6.
- Chinchilla compute-optimal scaling, and the deliberate departure of over-training a smaller model for inference economics. See Chapter 3.
- muP hyperparameter transfer and the warmup-stable-decay schedule that makes continue-then-anneal recipes composable. See Chapter 3.
- Data annealing with a high-quality final phase, whose measured gains are sharply scale-dependent, large on a smaller model and negligible on a much larger one. See
- Web-scale deduplication and quality filtering, and learned data-mixture weights. See 1.
- High-synthetic-fraction alignment data with the generation pipeline released alongside the model. See Chapter 12.
- The RLHF to DPO to RLVR/GRPO post-training arc, and reasoning that emerges from large-scale RL. See Chapter 10, Chapter 11, and Chapter 14.
- 3D parallelism, ZeRO/FSDP, selective activation recomputation, and fast checkpointing as the systems substrate of a multi-week run. See
The honest conclusion is the one the closed frontier forces: the most detailed map of frontier training methods is drawn from the open and open-weights tiers, and reasoning about the closed systems means extrapolating from that map. For the economics that sit underneath these choices, see Chapter 37; for the accelerators and networks that host the runs, see Chapter 30.
35.6 Further reading
- Llama 3 (open-weights training report, 4D parallelism, public reliability data). arXiv:2407.21783
- Gemini (systems stack named, scale numbers withheld). arXiv:2403.05530
- Constitutional AI / RLAIF (methods disclosure for a closed lab). arXiv:2212.08073
- GPT-4 technical report (no architecture, size, hardware, data, or method disclosed). cdn.openai.com/papers/gpt-4.pdf
- DeepSeek-V3 (open-weights report, frontier-scale FP8, auxiliary-loss-free MoE balancing, multi-token prediction). arXiv:2412.19437
- DeepSeek-R1 (reasoning via large-scale RL, emergent from pure RL). arXiv:2501.12948
- Qwen3 (open-weights report; data withheld). arXiv:2505.09388
- Mistral 7B (GQA plus sliding-window attention). arXiv:2310.06825
- Mixtral of Experts (sparse MoE, 2-of-8 experts). mistral.ai
- Magistral (asynchronous online-RL reasoning system). arXiv:2506.10910
- Grok-1 open release (open-weight MoE, no method paper). x.ai
- OLMo 2 (fully open: training-stability recipe and two-stage curriculum). arXiv:2501.00656
- Pythia (fully open suite with public checkpoints and reconstructable dataloader order). arXiv:2304.01373
- Nemotron-4 340B (open weights, high-synthetic-fraction alignment data, released generation pipeline). arXiv:2406.11704
- Auxiliary-loss-free load balancing for MoE. arXiv:2408.15664
- DeepSeek-V2 (multi-head latent attention, low-rank KV compression). arXiv:2405.04434
- GQA (near-MHA quality at near-MQA cost, uptrainable from MHA). arXiv:2305.13245
- Chinchilla (compute-optimal scaling). arXiv:2203.15556
- Beyond Chinchilla-optimal (inference-aware over-training rationale). arXiv:2401.00448
- muP (hyperparameter transfer: tune small, transfer to large). arXiv:2203.03466
- MiniCPM (warmup-stable-decay schedule). arXiv:2404.06395
- FineWeb / FineWeb-Edu (web-scale dedup and quality filtering). arXiv:2406.17557
- InstructGPT (RLHF post-training). arXiv:2203.02155
- Megatron-LM (3D parallelism on GPU clusters). arXiv:2104.04473
- CheckFreq (frequent asynchronous checkpointing). usenix.org