Efficient and Scaled Training
All sixteen sections are in draft status. Open problems are flagged inline and consolidated in §14.
This chapter is a new chapter with no AIMA 4e antecedent. Efficient and scaled training covers the infrastructure, algorithms, and systems engineering that make modern frontier-model training possible. By 2026, frontier-model training is substantial industrial infrastructure: multi-hundred-thousand-GPU datacenters; training runs costing hundreds of millions to billions of dollars; specialized parallelism strategies; substantial systems engineering for reliability and efficiency.
The chapter consolidates training-infrastructure material referenced across many other chapters: Foundation Models §3 (training compute as pillar), LLM §5 (pretraining pipelines), Reinforcement Learning §10 (RLHF infrastructure), Generative Models §8 (diffusion training), AI for Science §6 (large-scale training for science). This chapter develops the comprehensive treatment.
The chapter assumes the Deep Learning chapter, the Foundation Models chapter, and basic familiarity with neural-network training.
Scope and What This Chapter Is About
The chapter develops efficient and scaled training - the systems, algorithms, and infrastructure for training large models. We cover the hardware substrate (GPUs, TPUs, specialized accelerators), distributed training (data, tensor, pipeline, expert parallelism), memory-efficient training (mixed precision, activation checkpointing, ZeRO), efficient fine-tuning (LoRA, adapters, quantization), training-data infrastructure, training reliability at scale, energy and cost, and the broader economics of frontier-model training.
Approximate length target: 15,000–22,000 words.
§1. Motivation and Scope
Three worked instances
Three concrete instances spanning the modern frontier-training landscape.
Instance 1: Training GPT-4-class model on a 25K-GPU cluster. A major AI lab trains a frontier model on a 25,000-GPU H100 cluster for ~100 days. The training run uses 4D parallelism (data, tensor, pipeline, expert), mixed precision (FP8/BF16), substantial activation checkpointing, and continuous engineering attention. Hardware failures occur multiple times daily; checkpoint-and-restart is routine. The total compute is ~ FLOPs. The total cost (compute + engineering + data) approaches $1 billion. The resulting model substantially advances frontier capability.
Instance 2: Fine-tuning Llama-405B with LoRA on 8 H100 GPUs. A research team adapts Llama 3.1-405B (open weights, released July 2024) to a specific domain via LoRA (Low-Rank Adaptation). The full model (405B parameters × 2 bytes BF16) requires 810 GB just for weights - exceeds any single GPU. With 4-bit quantization, the model fits in ~200 GB. LoRA adapters (small rank-r matrices, perhaps 100M trainable parameters) train efficiently on 8 H100 (640 GB total). The fine-tuning takes hours; the resulting domain-adapted model performs substantially better than the base model on the target task - at a tiny fraction of full-finetuning cost.
Instance 3: Training a 7B model on a single workstation. A student trains a 7B-parameter model from scratch on 8 RTX 4090 consumer GPUs (192 GB total VRAM). Using FSDP (Fully Sharded Data Parallel) + gradient checkpointing + 8-bit Adam + Flash Attention, the training fits in memory. Training takes weeks on a substantially smaller dataset (~100B tokens vs frontier’s 10-15T tokens); the resulting model is substantially smaller capability than frontier but achieves meaningful capability for the student’s research.
These three instances span frontier ( FLOPs, $1B cost), open-weights fine-tuning (efficient adaptation of large open models), and academic-scale training (consumer GPUs, small but meaningful). They share the systems engineering nature of modern training; they differ in scale by ~6 orders of magnitude.
What efficient and scaled training is
A working definition. Efficient and scaled training is the body of techniques and infrastructure for training neural networks across the full scale spectrum - from consumer-GPU research models through hyperscale frontier models.
The components:
Hardware substrate. GPUs (NVIDIA dominant; AMD, Intel competing), TPUs (Google), specialized accelerators.
Distributed training algorithms. Multiple forms of parallelism for splitting computation across devices.
Memory efficiency. Techniques to fit larger models / batches in available memory.
Numerical efficiency. Lower precision arithmetic with quality preservation.
Data pipeline. Streaming, deduplication, mixture, tokenization at scale.
Reliability engineering. Hardware failure handling, checkpointing, monitoring.
Economic optimization. Cost-quality trade-offs, hardware selection, training-data investment.
The 2026 scale spectrum:
TRAINING SCALE SPECTRUM (2026)
Academic/individual Enterprise/research Frontier industry
~1 GPU to ~16 GPUs ~100 to ~10K GPUs ~10K to ~250K+ GPUs
$1K-$100K compute $1M-$100M compute $100M-$5B+ compute
Days to weeks Weeks to months Months
training time
<10B parameters 10B-200B parameters 200B-2T+ parameters
Custom small models Open-weights fine-tunes Frontier closed models
Research baselines Specialized models (GPT-4 class and beyond)The crucial property. Different scale regimes have qualitatively different engineering challenges. Frontier training is substantively different engineering from academic training.
Why efficient and scaled training matters
Three structural reasons.
1. Training is the dominant cost. For frontier models, training is the dominant cost (compute, data, engineering). Efficiency improvements directly reduce dollar cost.
2. Scale enables capability. The scaling-laws evidence (Foundation Models §6) demonstrates substantial capability gains from scale. Efficient infrastructure enables greater scale.
3. Accessibility matters. Efficient training enables broader participation. Open-weights models and efficient fine-tuning enable academic and enterprise innovation that would otherwise require frontier-lab resources.
The combined argument. Training infrastructure is foundational - it determines what models can be built, by whom, at what cost. The 2020-2026 efficiency gains have been substantial; continued advances directly enable continued capability scaling.
The 2022-2026 training-infrastructure inflection
A specific industry transition. Pre-2022, training infrastructure was substantial but tractable - multi-GPU but rarely multi-cluster. After 2022 (and accelerating through 2026), frontier training became industrial infrastructure:
Multi-billion-dollar datacenters. Stargate, Anthropic facilities, Google TPU pods, Meta clusters.
100K+-GPU clusters. xAI Colossus (200K+ H100, 2024), OpenAI Stargate (planned to 1M+ GPUs), Meta multiple 100K clusters.
Power infrastructure constraints. Training datacenters draw gigawatts of power. Power availability is now a binding constraint on training scale.
Specialized engineering. Frontier labs employ dedicated teams (often 100s of engineers) for training infrastructure.
The growth metrics.
Compute per frontier model. ~ FLOPs in 2020 (GPT-3) → ~ FLOPs in 2024 (frontier models) - three orders of magnitude in four years.
Training cost. ~100M+ (GPT-4 class) → ~$1B+ (frontier 2025-2026).
Cluster size. ~10K GPUs (2020) → 100K+ GPUs (2024) → 1M+ planned (2025-2026).
The 2026 state. Frontier training is substantial industrial infrastructure. Multiple companies have invested tens of billions of dollars in training infrastructure. Power, hardware, and engineering availability constrain the frontier.
Boundaries with adjacent chapters
Deep Learning (Chapter 1) covers the basic training algorithms (backpropagation, optimizers); this chapter covers scaling them.
Foundation Models §3 covers training compute as one of three pillars; this chapter develops the compute infrastructure.
Large Language Models §5 covers the LLM pretraining pipeline; this chapter covers the underlying infrastructure.
Reinforcement Learning §10 covers RLHF; this chapter covers the infrastructure for RLHF at scale.
Self-Supervised Learning §3 covers pretraining objectives; this chapter covers the training infrastructure.
Generative Models §8 covers diffusion-model training; this chapter covers the infrastructure.
AI for Science §6 covers training large models for science; this chapter covers training infrastructure.
What this chapter does not try to do
We do not provide a complete deep-learning textbook treatment.
We do not cover hardware architecture in depth (transistor-level, GPU SM architecture). We treat hardware at the system level.
We do not extensively cover GPU programming (CUDA, kernels). We focus on the systems and algorithms.
We do not develop datacenter facility engineering.
Position taken in this chapter
The chapter takes training infrastructure seriously as substantial industrial engineering. The 2022-2026 changes are substantial; the infrastructure investments are substantial; the open questions (efficient architectures, distributed training advances, energy, data) are substantive. The chapter develops both the established methodology and the active frontier.
§2. Historical Context
This section traces training infrastructure from single-GPU origins through current hyperscale.
A timeline of the inflection points:
2012 AlexNet (Krizhevsky-Sutskever-Hinton).
Trained on 2 NVIDIA GTX 580 GPUs over
~5 days. ImageNet breakthrough demonstrates
GPUs viable for deep learning.
│
▼
2014-2016 Multi-GPU and early distributed training.
Caffe, Torch, TensorFlow frameworks.
Data parallelism via parameter servers
(DistBelief 2012, TensorFlow 2015).
│
▼
2016 NVIDIA DGX-1 introduces 8-GPU box with
NVLink interconnect. Standard unit for
scaled training. P100 GPUs with HBM2.
│
▼
2017-2018 NVIDIA V100 with Tensor Cores; FP16
mixed-precision training mainstream;
Horovod (Uber 2017) ring-allreduce
enables efficient data-parallel scaling.
│
▼
2018-2019 Transformer scaling begins. BERT (340M
params, 2018) and GPT-2 (1.5B params,
2019). Memory becomes binding constraint.
│
▼
2019 Megatron-LM (NVIDIA 2019) introduces
tensor parallelism for transformers.
GPipe (Google 2019) introduces pipeline
parallelism.
│
▼
2020 ZeRO (DeepSpeed, Microsoft 2020).
Stage 1-3 optimizer/gradient/parameter
sharding. GPT-3 (175B params, OpenAI
2020) trained on ~10K V100 GPUs.
│
▼
2020-2021 NVIDIA A100 with HBM2e (40-80 GB).
TF32 and BF16 mixed precision. ML
Perf benchmarks driving competition.
│
▼
2022 Megatron-Turing NLG (530B params).
PaLM 540B (Google). Chinchilla
(DeepMind) demonstrates data scaling.
│
▼
2022-2023 H100 introduced (80 GB HBM3, FP8
support, Transformer Engine). NVLink
scaled to 256 GPUs per pod.
│
▼
2023 FSDP (PyTorch 2.0) as standard ZeRO-3
implementation. Flash Attention 2
(Dao 2023). Megablocks for MoE.
│
▼
2023-2024 100K+-GPU clusters emerge. xAI Colossus
(200K+ H100, 2024). Meta multiple
100K-GPU clusters. OpenAI/Anthropic/
Google substantial infrastructure.
│
▼
2024 FP8 training mainstream (DeepSeek-V3,
GPT-4o). NVIDIA Blackwell (B100/B200)
announced. AI training drawing
gigawatts of power. Energy as
binding constraint.
│
▼
2025 Stargate announced ($500B-scale
multi-year infrastructure
investment). 1M+-GPU clusters in
planning. AMD MI300X and Intel
Gaudi 3 competitive at hardware
level.
│
▼
2025-2026 NVIDIA Blackwell deployed at scale.
Power-availability and grid-
connection constraints binding for
largest training. Multiple
geographic locations strategy.We develop key phases below.
The pre-scaling era (2012-2017)
The 2012 inflection. AlexNet (Krizhevsky-Sutskever-Hinton 2012) was trained on 2 GTX 580 GPUs over ~5 days. Demonstrated GPUs viable for deep learning. Established the GPU-DL connection.
The pre-Transformer scaling. 2014-2017 saw substantial growth in model size and dataset size, but training remained tractable. Models like ResNet (60M params, He et al. 2015) trained in days on 8-GPU machines.
The infrastructure of 2015-2017. Multi-GPU via parameter servers (DistBelief, TensorFlow 1.x) and synchronous SGD. Substantial overhead from network communication; scaling to 100+ GPUs was difficult.
The Transformer-scaling era (2017-2020)
The Transformer (Vaswani et al. 2017) and BERT (Devlin et al. 2018) drove rapid model-size growth. BERT-large at 340M parameters required substantial infrastructure; GPT-2 at 1.5B parameters strained 2019 hardware.
Memory became the binding constraint. A 1.5B-parameter model in FP32 requires 6 GB just for weights; activations and optimizer state multiply this several-fold. Single GPUs in 2019 had 16-32 GB memory.
The infrastructure responses:
Mixed precision training (Micikevicius et al. 2017). FP16 weights and activations; FP32 master copy. ~2x memory and compute efficiency.
Gradient accumulation. Simulate larger batches by accumulating gradients across micro-batches.
Activation checkpointing (Chen et al. 2016). Trade compute for memory by recomputing activations during backward pass.
Distributed data parallelism. Multiple replicas of model on multiple machines.
Distributed parallelism for Transformers (2019-2020)
The key 2019-2020 innovations.
Tensor parallelism (Megatron-LM, NVIDIA 2019). Split individual layers across GPUs. Substantial communication required but enables much larger models per GPU group.
Pipeline parallelism (GPipe, Google 2019). Split layers across GPUs in sequence. Substantial pipeline-bubble overhead requires careful micro-batching.
ZeRO (DeepSpeed, Microsoft 2020). Stage 1 (optimizer state sharding), Stage 2 (gradient sharding), Stage 3 (parameter sharding). Reduces per-GPU memory substantially. Foundation for modern data parallelism.
The integration. GPT-3 (Brown et al. OpenAI 2020) at 175B parameters used combination of data, tensor, and pipeline parallelism - the “3D parallelism” pattern that became standard.
The hyperscale era (2022-2026)
The 2022-2023 transitions.
H100 introduction (NVIDIA, 2022-2023). 80 GB HBM3; FP8 support; Transformer Engine. Substantial improvement over A100 for Transformer training.
FSDP (Fully Sharded Data Parallel, PyTorch 2.0). ZeRO-3 made standard in PyTorch. Simplified large-model training substantially.
Mixture-of-Experts (MoE) scaling. Megablocks (Gale et al. 2023). Expert parallelism enables substantially larger total parameter counts at same active compute.
Flash Attention 2 (Dao 2023). Substantial speedup for attention computation; enabled longer contexts.
The 100K+-GPU cluster era. 2024 saw multiple frontier labs deploy 100K+-GPU clusters:
xAI Colossus (Sept 2024). 200K+ NVIDIA H100 GPUs in Memphis, Tennessee. Largest known cluster.
Meta. Multiple 100K+-GPU clusters for Llama 4 and beyond.
Google. TPU v5p pods scaling to substantial counts.
Anthropic. Substantial training infrastructure (precise numbers not public).
OpenAI. Substantial Azure infrastructure; Stargate planned at much larger scale.
The 2024-2026 power constraint. Frontier training datacenters now draw gigawatts of power. The 2025 announcement of Stargate (potentially $500B over years) reflects substantial infrastructure scale-up. Power grid availability, transmission infrastructure, and political/regulatory issues are binding constraints.
The 2026 state. Frontier training is substantial industrial infrastructure. The economic, engineering, and infrastructure scale is unprecedented for software development.
Where this leaves us in 2026
Training infrastructure is substantial industrial engineering. The 2022-2026 scale-up has been extraordinary. The frontier is constrained by power, hardware, and engineering availability - not just by software.
§3. Hardware Substrate
GPUs: NVIDIA dominant
NVIDIA GPUs dominate frontier training in 2026. The progression:
V100 (Volta, 2017). 16-32 GB HBM2; Tensor Cores for FP16; ~125 TFLOPS FP16. Standard for 2018-2020 training.
A100 (Ampere, 2020). 40-80 GB HBM2e; ~312 TFLOPS BF16; ~624 TFLOPS sparse INT8. Standard 2021-2023; many systems still operational.
H100 (Hopper, 2022-2023). 80 GB HBM3; Transformer Engine with FP8 support; ~990 TFLOPS BF16; ~1979 TFLOPS FP8. Standard 2023-2025; backbone of frontier training.
H200 (Hopper refresh, 2024). 141 GB HBM3e. Memory-bound improvement over H100.
B100/B200/GB200 (Blackwell, 2024-2025). 192 GB HBM3e; ~5 PFLOPS FP4 sparse; GB200 NVLink-connected pods. Significant capability over H100. Mainstream by mid-2025.
The trajectory. Each NVIDIA generation provides substantial improvement (~2-3x for the relevant precision). Memory capacity, bandwidth, and lower-precision support are key dimensions.
TPUs: Google’s custom silicon
Google’s Tensor Processing Units. Custom chips designed for neural-network training and inference.
TPU v4 (2022). ~275 TFLOPS BF16 per chip; 4096-chip pods. Substantial for Google internal training.
TPU v5e and v5p (2023-2024). Significant improvements. Available externally via Google Cloud.
TPU v6 / Trillium (2024). Substantial efficiency improvements.
The strategic role. TPUs enable Google to train at substantial scale without NVIDIA dependency. Architecturally optimized for matrix operations. PaLM, Gemini, and other Google frontier models trained on TPUs.
AMD MI series
AMD’s GPU competitors. MI250 (2022), MI300X (2023), MI325X (2024). Memory-rich (192 GB HBM3e on MI300X) with strong FP16 performance. Adopted by Meta, Microsoft, and others for training infrastructure diversity.
The 2026 status. AMD MI series is competitive at hardware level. Software ecosystem (ROCm vs CUDA) historically lagged but improving. Meta Llama 3 trained partly on MI300X.
Intel Gaudi
Intel’s accelerators (acquired from Habana Labs). Gaudi 2 (2022) and Gaudi 3 (2024). Competitive performance with software ecosystem challenges. Limited adoption.
Specialized accelerators
A category of custom AI chips.
Cerebras Wafer-Scale Engine. Entire silicon wafer as one chip. 850K+ cores; substantial on-chip memory. Used for specific workloads where extreme on-chip memory matters.
Groq LPU. Inference-focused; not for training.
Trainium (AWS), Maia (Microsoft). Hyperscaler custom AI chips. Used in their own infrastructure.
SambaNova, Tenstorrent. Other custom AI silicon companies.
The pattern. Specialized accelerators capture niches; NVIDIA dominates the broad market.
Networking and interconnects
Critical for distributed training.
NVLink and NVSwitch (NVIDIA). GPU-to-GPU high-bandwidth interconnect. NVLink 4.0 provides 900 GB/s per H100. NVSwitch enables full-bandwidth all-to-all within a node and within an NVLink-connected pod.
InfiniBand. High-bandwidth, low-latency networking. 400 Gb/s (NDR) standard for H100 clusters. 800 Gb/s (XDR) emerging.
Ethernet (RoCE). RDMA over Converged Ethernet. Alternative to InfiniBand. Used in some hyperscaler infrastructures.
Optical interconnects. Emerging for inter-rack communication. Cost and reliability improvements ongoing.
The 2026 pattern. NVIDIA NVLink + InfiniBand is the standard frontier-training stack. Substantial bandwidth (TB/s aggregate) needed for tensor and pipeline parallelism.
Memory and storage
HBM (High-Bandwidth Memory). Stacked DRAM directly on GPU package. 3-4 TB/s bandwidth on H100. Capacity (80-192 GB) is binding constraint for many models.
System memory (CPU DRAM). TB-scale per node. Used for offloading (ZeRO-Offload) and data preprocessing.
Storage. Multiple PB of NVMe storage for datasets. High-bandwidth parallel filesystems (Lustre, GPFS) for data loading.
Networked storage. Distributed object storage for checkpoints and datasets. S3 and equivalents.
Datacenters and power
The 2024-2026 binding constraint. Modern frontier-training datacenters draw:
A 100K-GPU H100 cluster. ~150 MW at full utilization. Comparable to a small city.
Frontier-scale 1M-GPU planned facilities. Multi-gigawatt.
The implications.
Site selection. Datacenter locations chosen for power availability.
Power-purchase agreements. Multi-year contracts with utilities.
Renewable energy and nuclear power. Both being negotiated for AI infrastructure.
Grid upgrades. Some sites require new transmission infrastructure.
The 2026 state. Power infrastructure is a substantial constraint on AI training scale. The Stargate-scale projects involve substantial energy infrastructure planning.
§4. Distributed Training Fundamentals
Why distribute
A single GPU has limits:
Compute throughput. ~1 PFLOPS per H100. A 100B-parameter model requires substantial compute per token; training on trillions of tokens requires substantial aggregate compute.
Memory capacity. 80-192 GB. A 500B-parameter model in BF16 requires 1 TB just for weights - exceeds any single GPU.
Memory bandwidth. ~3 TB/s. Limits throughput on memory-bound operations.
For frontier models (100B+ parameters, trillions of tokens), single-GPU training is infeasible. Distribution across multiple GPUs is essential.
The basic forms of parallelism
Four primary patterns.
Data parallelism (DP). Replicate model on multiple devices; each processes different batch slice; gradients averaged.
Tensor parallelism (TP). Split individual layers across devices; each device computes a portion of every layer.
Pipeline parallelism (PP). Assign different layers to different devices; activations flow through pipeline.
Expert parallelism (EP). For MoE models, distribute experts across devices.
Modern frontier training combines all four: 4D parallelism.
Data parallelism (DP)
The simplest form. Each device holds a complete copy of the model. Different devices process different micro-batches.
The procedure:
DATA PARALLELISM (CLASSIC)
Forward pass:
Device_0: process batch[0:B/N]
Device_1: process batch[B/N:2B/N]
...
Device_{N-1}: process batch[(N-1)B/N:B]
Backward pass:
Each device computes gradients on its batch portion.
Gradient synchronization:
All-reduce gradients across devices (average them).
Optimizer step:
Each device applies same gradient update; replicas stay in sync.The scaling. DP scales well in throughput - N GPUs process N times more data per step. But: each GPU still needs full model memory.
The communication cost. All-reduce of full gradients per step. For large models, this is substantial bandwidth.
Tensor parallelism (TP)
Split within layers across devices. Standard pattern: split matrix multiplications.
For a linear layer with weight matrix of shape :
Column-parallel. Split along output dimension. Each device computes its portion of . Concatenate outputs.
Row-parallel. Split along input dimension. Each device computes partial product. All-reduce to combine.
The Megatron-LM pattern for Transformer:
MEGATRON TENSOR PARALLELISM
Attention:
QKV projection: column-parallel (split heads across devices)
Attention: each device computes its head subset
Output projection: row-parallel (combine heads, all-reduce)
MLP:
Up projection: column-parallel
GeLU
Down projection: row-parallel (all-reduce)The scaling. TP enables much larger models per device group. Typical: 8-way TP within a single 8-GPU node (NVLink-connected). Beyond 8-way TP, communication costs grow substantially.
Pipeline parallelism (PP)
Split across layers. Different devices hold different layers; activations flow through pipeline.
The basic pipeline:
PIPELINE PARALLELISM (NAIVE)
Devices: [D0] -> [D1] -> [D2] -> [D3]
Layers: [L0-9] [L10-19] [L20-29] [L30-39]
Forward:
batch -> D0 -> activations -> D1 -> activations -> ...
Backward:
gradients flow back through pipeline.
Problem: pipeline bubbles. While D3 is computing,
D0 is idle waiting for next batch.The pipeline-bubble problem. Naive PP has substantial idle time. Mitigations:
Micro-batching. Split global batch into many micro-batches; pipeline them through. Reduces bubble fraction.
1F1B scheduling. Interleave forward and backward passes to keep all devices busy.
Interleaved pipelining (Megatron). Each device holds multiple layer groups; finer-grained pipelining.
The scaling. PP enables training models with more layers than fit in TP group. Typical: 8-16 pipeline stages.
Expert parallelism (EP)
For Mixture-of-Experts (MoE) models. Cross-reference LLM §4 MoE.
The pattern. Distribute experts across devices. Each token routed to specific expert(s); only relevant device(s) process each token.
The all-to-all communication. Tokens must be routed to their experts; outputs must be routed back. Substantial all-to-all bandwidth required.
The scaling. EP enables much larger total parameter counts at same active compute. DeepSeek-V3 (671B total, 37B active) uses substantial EP.
4D parallelism
Frontier training combines all four:
4D PARALLELISM PATTERN (FRONTIER)
For a 25,000-GPU cluster:
- TP = 8 (within node, NVLink)
- PP = 16 (across nodes, layer pipeline)
- EP = 16 (for MoE experts)
- DP = 25,000 / (8 * 16 * 16) = ~12 replicas
Each "model replica" uses 8*16*16 = 2048 GPUs.
12 replicas process different batch slices in parallel.
Total: 25,000 GPUs collaborating on one model.The complexity. 4D parallelism is substantial engineering. Numerous edge cases; numerous performance bottlenecks; substantial tuning required.
The frameworks. Megatron-DeepSpeed, PyTorch FSDP, JAX-based frameworks (T5x, Pax for Google). These frameworks abstract some of this complexity but substantial expertise still required.
Communication patterns
The aggregate bandwidth picture. For a frontier training run:
Intra-node (NVLink). TB/s aggregate.
Inter-node (InfiniBand). 400-800 Gb/s per link.
All-reduce across cluster. Substantial network utilization for DP gradient sync.
Communication often dominates compute for some parallelism configurations. Optimization is substantial engineering work.
§5. Memory-Efficient Training
Memory budget per device
A typical breakdown for transformer training:
PER-DEVICE MEMORY BREAKDOWN (ROUGH)
1. Model parameters (weights)
2. Gradients
3. Optimizer state (Adam: 2x parameters)
4. Activations (forward pass)
5. Temporary buffers (matmul workspace)
6. CUDA framework overhead
For BF16 training of N-parameter model with Adam:
- Weights: 2N bytes
- Gradients: 2N bytes
- Adam state (m, v in FP32): 8N bytes
- Activations: substantial (depends on batch, seq len)
Total: roughly 16-20N bytes minimum + activationsFor a 70B model: ~1.4 TB just for weights+gradients+optimizer at full precision. Exceeds even multi-GPU memory unless distributed.
Mixed precision training
The basic technique. Use lower-precision (FP16, BF16, FP8) for most computation; keep master copy in higher precision.
FP16 (Micikevicius et al. 2017). Standard. Forward and backward pass in FP16; master weights in FP32. Loss scaling to handle small gradients.
BF16. Same exponent range as FP32 (no loss scaling needed). Used widely for Transformer training. Mainstream by 2022.
FP8 (H100 Transformer Engine, 2022-2023). Two formats: E4M3 (more precision, less range) and E5M2 (more range, less precision). DeepSeek-V3 used FP8 training at scale (2024).
Sub-FP8 (FP4, etc.). Active research. Blackwell supports FP4.
The 2026 standard. BF16 for most training; FP8 for largest models where memory and bandwidth savings substantial. FP4 emerging.
Activation checkpointing
The technique. Instead of storing all forward-pass activations for backward pass, recompute them.
The trade-off. Save memory (don’t store activations); spend compute (recompute during backward).
The pattern.
ACTIVATION CHECKPOINTING
Without:
Forward: compute and store all activations.
Backward: use stored activations for gradients.
Memory: O(layers * activations_per_layer)
Compute: 1 forward + 1 backward
With:
Forward: compute, only store at checkpoints.
Backward: recompute non-checkpointed activations.
Memory: O(sqrt(layers))
Compute: 1 forward + 1 forward (recomp) + 1 backward
~33% more compute, substantial memory savings.The 2026 status. Standard practice. Most frontier training uses some level of checkpointing.
ZeRO and FSDP
ZeRO (Rajbhandari et al., Microsoft 2020). Eliminate memory redundancy across data-parallel replicas.
The three stages.
ZeRO-1. Shard optimizer state across DP replicas. ~4x memory savings for Adam.
ZeRO-2. Additionally shard gradients. ~8x.
ZeRO-3. Additionally shard parameters. Each device holds only its parameter shard. Substantial memory savings.
The trade-off. More communication (parameters must be all-gathered for forward/backward). But memory savings enable training larger models on fewer devices.
FSDP (Fully Sharded Data Parallel) in PyTorch 2.0. Standard ZeRO-3 implementation. Now standard for many large-model training scenarios.
Offload techniques
For models too large for GPU memory.
ZeRO-Offload. Offload optimizer state and gradients to CPU memory.
ZeRO-Infinity. Offload to CPU and NVMe storage.
The trade-off. CPU/NVMe is much slower than GPU memory. Substantial overhead. Used when no other option.
Gradient accumulation
Trade time for memory. Process micro-batches sequentially, accumulating gradients. Apply optimizer step after accumulating across full effective batch.
The pattern:
GRADIENT ACCUMULATION
for step in 1..num_steps:
optimizer.zero_grad()
for micro_batch in micro_batches(batch, num_micro=K):
loss = forward(model, micro_batch)
loss.backward() # gradients accumulate
optimizer.step()Enables larger effective batch sizes than fit in memory. Standard practice for large-batch training.
Efficient attention
The attention bottleneck. Vanilla attention is O(L²) memory and compute in sequence length L. For long contexts, this is binding.
Flash Attention (Dao et al. 2022; Flash Attention 2, 2023). Restructured attention computation to avoid materializing the L×L attention matrix. Substantial speedup and memory savings.
The result. Long-context training (128K, 1M tokens) becomes feasible. Standard in modern training stacks.
Sliding window attention. Limit attention to local window. Reduces compute/memory at expense of full context.
Sparse attention. Various patterns (BigBird, Longformer). Active research.
Optimizer memory
Adam’s 8N bytes of state per parameter is substantial. Alternatives:
8-bit Adam (bitsandbytes). Quantize optimizer state to 8 bits. ~2x memory savings with quality preservation.
Adafactor. Approximation that reduces state. Used in T5 and other Google training.
Lion (Chen et al. 2023). Sign-based optimizer with smaller state. Discovered via search.
SGD with momentum. Simpler. Sometimes competitive for very large batches.
The combined effect
Modern frontier training uses all of these techniques in combination:
Mixed precision (BF16/FP8)
Activation checkpointing
FSDP (ZeRO-3)
4D parallelism
Flash Attention
Memory-efficient optimizer
Combined, these enable training models that would be infeasible without them.
§6. Efficient Fine-Tuning
The fine-tuning context
For pretrained models, full fine-tuning requires updating all parameters. This is:
Memory-expensive. Need gradients for all parameters.
Storage-expensive. Each fine-tuned variant is a full model copy.
Compute-expensive. Forward+backward through all parameters.
For large models (70B+), full fine-tuning is infeasible for most organizations. Efficient fine-tuning methods address this.
LoRA: Low-Rank Adaptation
LoRA (Hu et al., Microsoft 2021). The key insight: weight updates during fine-tuning often have low intrinsic rank.
The technique. Instead of updating directly, learn low-rank update:
where is and is , with small rank (typically 4-64).
The training. Only and are trained. stays frozen.
The procedure:
LORA TRAINING
Inputs:
- Pretrained model with weights W_1, W_2, ..., W_L
- Fine-tuning data
- Rank r (e.g., 16)
Initialization:
For each target layer i:
A_i = random Gaussian (small)
B_i = zero
So initially BA = 0 and model unchanged.
Training loop:
1. Forward pass: y = (W + BA) x
where W is frozen, only A, B trainable.
2. Backward pass: compute gradients only for A, B.
3. Optimizer step: update only A, B.
Outputs:
LoRA adapters (A, B for each target layer).
Original model weights unchanged.The memory savings. For a layer of size with rank- LoRA, you train parameters instead of . For : ~130K parameters vs ~17M. ~100x fewer trainable parameters.
The inference. Can either:
Keep separate. Adds slight inference latency.
Merge: compute once, replace . Zero inference overhead.
LoRA variants
QLoRA (Dettmers et al. 2023). LoRA + 4-bit quantization of base model. Enables fine-tuning very large models on modest hardware. Standard for academic and enterprise fine-tuning.
DoRA (Liu et al. 2024). Decompose weight into magnitude and direction; LoRA only on direction. Some empirical improvements.
LoRA+ (Hayou et al. 2024). Different learning rates for and . Some improvements.
Multi-LoRA serving. Serve many LoRA-adapted variants of base model with shared base. Inference platforms (vLLM, others) support this.
Other PEFT methods
Adapters (Houlsby et al. 2019). Insert small trainable modules between layers. Older approach; mostly superseded by LoRA.
Prefix tuning (Li-Liang 2021). Train continuous prefix prompts. Some applications.
P-Tuning v2 (Liu et al. 2021). Refined prefix tuning.
(IA)³ (Liu et al. 2022). Multiplicative adapters.
BitFit (Zaken et al. 2021). Only train bias parameters.
The 2026 picture. LoRA (and QLoRA) is dominant. Other PEFT methods used in specialized cases.
Quantization
Reduce model precision for inference (and sometimes training).
Post-training quantization. Quantize trained model. Various techniques (GPTQ, AWQ, SmoothQuant).
INT8 (8-bit). Standard. Substantial memory savings; minimal quality loss for many models.
INT4 (4-bit). More aggressive. Standard via GPTQ, AWQ. ~4x memory savings vs FP16.
Sub-4-bit. Active research.
Quantization-aware training. Train model with quantization in the loop. Better quality at low precision.
The 2026 status. INT4 quantization is standard for inference. INT8 for some training scenarios. Quantization is substantial enabler for accessible large-model deployment.
Full fine-tuning when feasible
For smaller models or when LoRA insufficient, full fine-tuning still happens. Standard techniques:
DeepSpeed ZeRO or FSDP for distributed training.
Mixed precision (BF16).
Activation checkpointing.
Smaller batch sizes than pretraining.
Lower learning rates (typically 10-100x lower than pretraining).
For 7B-70B models, full fine-tuning is feasible on 8-64 H100 GPUs. For 100B+, LoRA-style approaches dominate due to cost.
§7. Training Data Infrastructure
Data scale
Modern frontier training uses trillions of tokens. Examples:
GPT-3 (2020). ~300B tokens.
Chinchilla (2022). Argued ~20 tokens per parameter optimal (so ~1.4T for 70B model).
Llama 2 (2023). 2T tokens.
Llama 3 (2024). 15T tokens.
Frontier 2025-2026. 15-30T+ tokens.
The data storage. 15T tokens at ~4 bytes per token (post-tokenization, packed) is ~60 TB. Pre-tokenization (raw text), substantially more.
Data sources
The major sources of pretraining data.
Common Crawl. Web crawl. Substantial corpus (~100s of TB raw). Heavily filtered for quality.
Curated web corpora. RefinedWeb, FineWeb, FineWeb-Edu. Filtered/cleaned versions of Common Crawl.
Books. Books3 (controversial copyright), library corpora.
Code repositories. GitHub, package repositories.
Reference sources. Wikipedia, Stack Exchange, ArXiv.
Synthetic data. Generated by other models. Increasingly important.
The legal and ethical questions. Cross-reference Generative Models §12 critique on copyright. The data-source landscape is contested.
Data pipeline
A typical modern pipeline:
PRETRAINING DATA PIPELINE
1. Raw data acquisition
Web crawls, books, code, etc.
2. Quality filtering
- Language identification
- Quality classifiers (often LLM-based)
- Heuristic filters (length, format)
3. Deduplication
- Exact dedup (hashes)
- Fuzzy dedup (MinHash + LSH)
- Substring dedup (suffix arrays)
4. PII filtering
Remove personally-identifiable information.
5. Toxicity / safety filtering
Remove harmful content.
6. Mixture composition
Combine sources with chosen weights.
7. Tokenization
Convert text to tokens (cross-reference LLM §3).
8. Packing
Pack tokens into fixed-length sequences.
9. Shuffling
Random shuffle for training.
10. Streaming to training
Distribute to training workers.Each step is substantial engineering. Data quality has substantial impact on model quality.
Deduplication
A critical step. Duplicate training data hurts model quality and increases memorization.
Exact dedup. Hash documents; remove exact duplicates.
Fuzzy dedup. MinHash + Locality-Sensitive Hashing. Detect near-duplicates.
Substring dedup. Find long shared substrings across documents.
The scale. Frontier datasets have billions of documents. Dedup requires substantial compute and engineering. Often takes weeks for largest corpora.
Data mixing
Weights for different sources matter. Examples:
More code. Models gain coding capability.
More math. Models gain math capability.
More multilingual. Models gain multilingual capability.
Modern training carefully tunes mixture weights. Empirical search; sometimes per-stage (different mixtures for early vs late pretraining).
Curriculum and data ordering
The classic view: shuffle data. Recent evidence: ordering may matter.
Curriculum learning. Easy → hard progression. Some empirical evidence of benefit.
Domain mixing schedules. Vary mixture over training. Some labs use different mixtures for different training stages.
The 2026 picture. Active research area; not yet standard practice. Most training still uses shuffled data.
Synthetic data
A substantial 2024-2026 development. Use models to generate training data for other models.
Reasoning-RL synthetic data. Reasoning models generate reasoning chains for SFT (cross-reference Reasoning Models §9).
Instruction-tuning synthetic data. GPT-generated instructions/responses. Standard in alpaca-style fine-tuning.
Code synthetic data. Generate code problems and solutions. Substantial for Code Llama and similar.
Math synthetic data. Generate math problems with verified solutions.
The concerns.
Model collapse. If models train on outputs of models, quality may degrade over generations. Cross-reference Generative Models §10.
Bias amplification. Generator biases propagate.
Lower diversity. Synthetic data may be less diverse than natural.
The 2026 picture. Synthetic data is substantial component of modern training data. Concerns mitigated by mixing with natural data, filtering, and human verification.
Data infrastructure
The systems aspects.
Distributed storage. PB-scale storage for raw and processed data. S3, parallel filesystems.
Streaming. Don’t load full dataset; stream to training workers.
Tokenization pipelines. Convert raw text to tokens at scale. Often distributed.
Data loaders. PyTorch DataLoader, Hugging Face datasets, custom frameworks.
The 2026 picture. Data infrastructure is substantial part of training engineering. Often underappreciated relative to model engineering but equally important.
§8. Training Reliability at Scale
The reliability challenge
A 25,000-GPU H100 cluster has many failure sources:
GPU failures. Each GPU has non-zero failure rate. With 25K GPUs, multiple failures per day expected.
Network issues. Cables, switches, link errors.
Power events. Brownouts, partial outages.
Software bugs. CUDA, framework, application.
Hardware degradation. Slow GPUs, intermittent errors.
A frontier training run of 100 days will encounter many failures.
Checkpoint and restart
The basic strategy. Save model state periodically; restart from checkpoint after failures.
The checkpoint cost. For a 500B-parameter model, checkpoint is ~1 TB (weights + optimizer state). Writing to storage takes minutes; restarting takes substantial time.
The frequency trade-off. More frequent checkpoints = less lost work but more overhead. Typical: every hour or every N training steps.
Asynchronous checkpointing
A technique. Save checkpoint to memory; asynchronously flush to storage. Reduces blocking overhead.
The pattern. Training continues while checkpoint writes. Substantial savings for long runs.
Hardware monitoring
Continuous monitoring of:
GPU temperatures. Detect overheating before failure.
GPU utilization. Detect stragglers (slow GPUs).
Network metrics. Bandwidth, packet loss, link errors.
Memory errors. ECC errors, OOM.
Power consumption. Detect anomalies.
Automated detection and mitigation. Replace failing GPUs; route around bad networks; etc.
Straggler mitigation
A specific challenge. One slow GPU in synchronous training holds back the entire cluster.
The mitigations.
Skip stragglers. Continue training with reduced batch size; resync later.
Replace stragglers. Hot-swap (where possible).
Asynchronous methods. Some training (especially RL) uses asynchronous updates.
The frontier-training pattern. Substantial automation for detecting and mitigating stragglers.
Failure recovery automation
Modern training systems have substantial automation:
Auto-detect. Detect hardware failures, software hangs, network issues.
Auto-restart. Restart from latest checkpoint.
Auto-debugging. Capture state at failure; preserve for postmortem.
The investment. Frontier labs employ teams dedicated to training reliability. The economic value of preventing one day of lost training (potentially $1M+) justifies substantial engineering investment.
Determinism and reproducibility
A subtle but important issue. Distributed training is often non-deterministic - same code with same data may produce different results across runs.
Sources of non-determinism.
Floating-point order of operations.
Communication ordering.
Hardware variation.
Implications. Reproducing results is hard. Debugging is hard. Some labs invest in deterministic training infrastructure.
Continuous improvement
Frontier training runs often include continuous monitoring and adjustment:
Loss curves monitored continuously. Spikes investigated.
Generated samples evaluated periodically. Quality regressions detected.
Hyperparameters sometimes adjusted mid-run. Learning rate, batch size.
The hands-on approach. Training is not “set and forget” - active engineering attention is substantial during training runs.
§9. Inference Optimization
Why inference matters
Training cost is substantial but one-time. Inference cost is recurring - every query costs compute.
For successful products, aggregate inference cost can dwarf training cost. ChatGPT serves billions of queries; inference infrastructure is substantial.
Inference optimization has substantial economic value.
KV cache management
The KV cache (cross-reference LLM §6) stores attention keys and values for previously-generated tokens. For long contexts, KV cache is substantial memory.
KV cache size. For batch B, sequence L, layers N, heads H, dim D:
For a 100B-parameter model serving long contexts, KV cache per request can be GBs.
Paged attention (vLLM). Manage KV cache like virtual memory; allocate in fixed-size blocks. Enables higher batch density.
KV cache quantization. Quantize cache to lower precision. Memory savings.
Sliding-window cache. Drop old cache entries.
Speculative decoding
The technique. Use a small “draft” model to predict multiple tokens; verify with large model.
The procedure. Draft model generates K candidate tokens. Large model evaluates all K in parallel. Accept matching tokens; reject diverging.
The benefit. Latency improvement when draft and large model agree. Substantial speedup for many workloads.
Quantization for inference
INT8 and INT4 inference is standard. Substantial memory and latency savings.
Methods.
GPTQ (Frantar et al. 2022). Layer-by-layer quantization.
AWQ (Lin et al. 2023). Activation-aware quantization.
SmoothQuant. Smooth outliers before quantization.
FP8 inference. Native on H100; mainstream by 2024-2025.
Continuous batching
For LLM serving, requests arrive asynchronously. Continuous batching:
Batch active requests dynamically.
Add new requests as old ones complete.
Don’t wait for entire batch to finish.
Substantial throughput improvement vs naive batch serving.
Inference platforms
The frameworks. vLLM, TensorRT-LLM (NVIDIA), SGLang, TGI (Hugging Face), LMDeploy. Production inference systems implement the techniques above.
Cloud providers (AWS, GCP, Azure) and AI labs (OpenAI, Anthropic, Google) operate substantial inference infrastructure. The 2026 picture: dedicated inference engineering teams.
Specialized inference hardware
NVIDIA H100, H200, B100 are standard. Other options:
Groq LPU. Inference-specific accelerator. Low latency.
Cerebras inference. Wafer-scale inference (~1000 tokens/sec for 70B model demonstrated 2024).
AWS Inferentia. Inference-specific.
Custom inference chips in some hyperscalers.
The pattern. Inference-specific hardware can be substantially more cost-effective than general training GPUs for high-volume inference.
§10. Architectural Efficiency
Mixture-of-Experts
Cross-reference LLM §4 MoE. From a training-efficiency perspective:
The argument. MoE enables larger total parameter count at same active compute. Better quality per training FLOP.
The cost. Memory (must store all expert parameters). Communication (all-to-all routing). Engineering complexity.
The 2026 picture. MoE is standard for many frontier models (DeepSeek-V3, Mixtral, GPT-4 reportedly). Active vs total parameters distinguishes capability from compute cost.
Grouped-query and multi-query attention
Cross-reference LLM §4. Reduce KV cache size by sharing keys/values across query heads.
MQA (Multi-Query Attention). Single K, V across all query heads. Substantial KV cache reduction; some quality loss.
GQA (Grouped-Query Attention). Groups of query heads share K, V. Balance between MHA quality and MQA efficiency.
The 2026 picture. GQA standard in most modern Transformers.
Sliding window attention
Limit attention to fixed-size window. Reduces compute and memory at expense of full context.
Used in Mistral, Gemma, others. Standard for some models.
Long-context attention
Beyond standard self-attention.
Linear attention (Performer, Linformer, etc.). Approximate attention with linear complexity. Various trade-offs.
State-space models (Mamba). Alternative architecture. Linear in sequence length. Some empirical results competitive with Transformers.
Hybrid Transformer-SSM. Combine. Active research.
The 2026 picture. Transformers still dominant; alternatives compete in specific niches.
Speculative architectures
Active research areas.
Medusa decoding. Predict multiple future tokens with auxiliary heads.
EAGLE decoding. More sophisticated speculative methods.
Lookahead decoding. Various speedup approaches.
Efficient layer designs
SwiGLU. Replaces ReLU/GeLU. Standard in modern Transformers.
RMSNorm. Replaces LayerNorm. Slightly more efficient.
Pre-normalization vs post-normalization. Pre-norm standard in modern.
Rotary position embeddings (RoPE). Replace learned absolute positions. Standard.
The pattern. Modern Transformer designs incorporate many small efficiency improvements. Combined effect substantial.
§11. Energy and Cost
Training energy
Modern frontier training consumes substantial energy.
Per-training estimates.
GPT-3 (2020). ~1.3 GWh (~$200K electricity).
GPT-4-class (2023-2024). Estimated multi-GWh to 10s of GWh.
Frontier 2025-2026. Estimated 50-200+ GWh per training run.
Per-token estimates. Frontier model inference: ~0.1-1 Wh per typical query.
Aggregate. All AI training and inference in 2025-2026 estimated to consume gigawatts of continuous power globally.
The trajectory. Continued growth. Power consumption is increasingly material to industry and policy discussions.
Cost economics
Frontier training costs.
Compute cost. Dominant. ~ FLOPs at ~ per FLOP on rented infrastructure = $100M+.
Engineering cost. $10s of millions for the team.
Data cost. Data acquisition, licensing, processing.
Failures. Failed training runs (debugging, restarts).
Frontier 2025-2026 training: 5B+ per major training run.
Hardware vs cloud economics
The trade-off.
Own hardware. Capex; multi-year amortization; substantial operational overhead. Cost-effective at large scale.
Cloud rental. Opex; immediate availability; flexibility. Cost-effective at smaller scale.
The 2026 pattern. Largest labs own substantial hardware; supplement with cloud. Academic and smaller enterprises primarily use cloud.
The economic structure
A substantial consideration. AI training has high fixed costs (training the model) and moderate marginal costs (inference per query). This creates substantial economies of scale.
The implications.
Concentration. Few organizations can afford frontier training.
Open vs closed. Open-weights models (Llama, R1, etc.) provide alternative path.
Capital intensity. AI is increasingly capital-intensive industry.
The environmental conversation
A substantial 2024-2026 issue. Critics raise environmental concerns about AI training.
Carbon emissions. Depends on electricity source. Some training on renewable energy; some on fossil. Total emissions estimated in millions of tons CO2 per year for frontier-training industry.
Water consumption. Datacenter cooling uses substantial water.
Local impacts. Datacenters affect local power grids, water supplies, communities.
The mitigations. Renewable energy investments. More efficient hardware. More efficient algorithms.
The pattern. Environmental concerns are substantive and under-addressed. Active research and policy attention.
Cost-quality optimization
A specific application of training-cost analysis. For a fixed budget, what’s the optimal allocation?
Chinchilla (Hoffmann et al., DeepMind 2022) provided framework for compute-optimal training (cross-reference Foundation Models §6).
Reasoning models (cross-reference Reasoning Models §7) added inference-time-compute as additional axis.
The 2026 picture. Optimal allocation is active research. Frontier labs invest substantially in optimization studies.
§12. Open vs Closed and the Ecosystem
The open-weights landscape
Major open-weights models in 2024-2026:
Llama series (Meta). Llama 2 (2023), Llama 3 (2024), Llama 3.1-405B (July 2024), Llama 4 (2025). Substantial frontier capability.
DeepSeek series. DeepSeek-V2 (2024), V3 (Dec 2024), R1 (Jan 2025). Substantial capability with substantially lower training cost claims.
Qwen series (Alibaba). Qwen2, Qwen2.5, Qwen3 (2025). Substantial capability.
Mistral. Various models including Mixtral MoE.
Falcon (TII). Earlier open-weights work.
Google Gemma series. Smaller open-weights from Google.
The pattern. Open-weights models have become substantially competitive with frontier closed models.
Closed-weights frontier
Major closed-weights models in 2026:
OpenAI. GPT-4, GPT-4o, o1, o3, o4 series.
Anthropic. Claude 3, 3.5, 4, 5 series.
Google. Gemini 1.5, 2.0, 2.5 Pro and variants.
xAI. Grok series.
The competitive landscape. Frontier closed models retain some capability lead but margins have narrowed. Open-weights models substantially competitive on many benchmarks.
The closed-source advantages
Why some labs keep weights closed.
Competitive advantage. Capability advantage maintained longer.
Safety concerns. Easier to control deployment of closed models.
Revenue model. API access fundable; open weights commoditize.
Compliance. Some regulations and policies easier with closed models.
The open-weights advantages
Why some labs release open weights.
Ecosystem development. Open weights enable substantial third-party development.
Reputation and recruiting. Open contributions attract talent.
Safety transparency. Open weights enable third-party safety research.
Strategic positioning. Counter dominance of closed-source competitors.
Lower marginal control. Lower competitive pressure in some areas.
The R1 inflection
January 2025. DeepSeek-R1 (open-weights reasoning model with substantial capability and disclosed methodology) substantially shifted the open vs closed dynamic. Demonstrated:
Frontier reasoning capability achievable with open methodology.
Substantially lower training costs possible (claimed ~$6M for V3).
Reproducibility of frontier-lab methodology.
The impact. Substantial discussion about competitive dynamics, US-China dynamics, and the future of open vs closed.
The infrastructure platform layer
A specific 2024-2026 development. Infrastructure platforms for fine-tuning, inference, and deployment of (especially open-weights) models.
Hugging Face. Model hub; transformers library; inference; fine-tuning.
Together AI, Anyscale, Modal. Compute platforms.
vLLM, TensorRT-LLM, SGLang. Inference frameworks.
Replicate, Fireworks, Groq. Inference-as-a-service.
The pattern. Substantial platform layer enables broader access to frontier capability.
Compute access economics
A substantial 2026 reality. Open weights enable broader access - but training and inference still require substantial compute.
The cost structure. Open weights eliminate model-licensing cost but compute cost remains. For frontier-scale fine-tuning or training, compute remains a substantial barrier.
The democratization is partial. Open weights + cheap inference (small model or distilled) enables substantial democratization. Open weights + expensive inference (large frontier model) is less democratizing.
§13. Connections to Other Chapters
This chapter sits at the intersection of multiple chapters:
Deep Learning (Chapter 1) provides the foundational training algorithms (backpropagation, optimizers); this chapter scales them.
Foundation Models (§3, §6) covers training compute as one of three pillars and scaling laws; this chapter develops the infrastructure.
Large Language Models (§4, §5) covers LLM architectures (MoE, attention variants) and pretraining; this chapter develops the underlying systems.
Reinforcement Learning (§10, §11, §12) covers RLHF and reasoning RL; this chapter covers the infrastructure for these.
Self-Supervised Learning (§3) covers pretraining objectives; this chapter covers the training infrastructure.
Generative Models (§8) covers diffusion training; this chapter covers the infrastructure.
AI for Science (§6) covers training for science; this chapter covers training infrastructure.
Reasoning Models (§9) covers distillation; this chapter covers efficient fine-tuning generally.
Retrieval-Augmented Generation (§7) covers retrieval-related infrastructure; this chapter covers training infrastructure.
AI Agents (§9) covers agent frameworks; this chapter covers the underlying compute infrastructure.
Alignment (§4) covers RLHF; this chapter covers RLHF infrastructure.
Evaluation covers training-quality evaluation; this chapter covers training-cost dimensions.
The pattern. Training infrastructure underlies essentially all modern AI development. This chapter is foundational to many other chapters’ subjects.
§14. Critiques, Limitations, and Open Problems
The scaled-training paradigm is substantial but raises substantial critiques.
Critique 1: Compute and capital concentration
The position. Frontier training requires substantial capital. This concentrates AI development in a few well-resourced organizations.
The evidence. Frontier training costs 5B+. Only a handful of organizations globally can afford this. Academic AI labs are essentially excluded from frontier training.
The response. Open weights (Llama, R1) provide alternative access path. Efficient fine-tuning (LoRA, QLoRA) enables broader adaptation. Distillation enables broader capability access. Compute costs decreasing over time.
The honest reading. Frontier training is substantively concentrated. The democratization story is partial. Capital intensity is a substantive concern for the AI field’s structure.
Critique 2: Environmental cost
The position. Frontier training consumes substantial energy and water. The environmental cost is substantial and growing.
The evidence. Frontier-training infrastructure draws gigawatts. Aggregate AI industry consumption estimated at substantial fraction of total IT industry power.
The response. Renewable energy investments. Hardware efficiency improvements. Algorithmic efficiency improvements. Environmental cost per useful capability decreasing.
The honest reading. Environmental cost is substantive. Mitigations help but don’t eliminate. Active policy and engineering attention warranted.
Critique 3: Diminishing returns to scale
The position. Pre-training scaling has hit diminishing returns. Doubling training compute no longer doubles capability.
The evidence. GPT-4 → GPT-4o → o1/o3 trajectory suggests inference-time scaling produces more gain than additional pretraining scale.
The response. Pre-training scaling continues to produce gains. Inference-time scaling is additive not substitute. New architectures (MoE, alternative attention) provide additional capability per FLOP.
The honest reading. The marginal benefit of additional pretraining scale is less than 2020-2022 trajectory suggested. Inference-time scaling is genuinely additive. The future likely involves both pretraining and inference scaling.
Critique 4: Data limits
The position. Frontier models have substantially consumed the available high-quality text data on the internet. Continued data scaling may be limited.
The evidence. Estimates suggest frontier 2025-2026 training has used substantial fraction of available high-quality data. Synthetic data has quality limits.
The response. Multimodal data (video, audio) is substantially less consumed. Synthetic data quality improving. Continued data quality improvements via filtering.
The honest reading. Text-data limits are substantive. Multimodal data and synthetic data are partial answers but not complete substitutes.
Critique 5: Reliability and engineering brittleness
The position. Frontier training is substantially brittle. Failures common; debugging hard; reproducibility limited.
The evidence. Public discussions from frontier labs reveal substantial reliability engineering investment. Multiple major training-run failures documented.
The response. Reliability engineering improving substantially. Automation, monitoring, checkpointing all improving.
The honest reading. Training reliability remains a substantive engineering challenge. Substantial investment continues; substantial work remains.
Open problems
OP-EST-1. Efficient pretraining algorithms. Current pretraining is substantially inefficient compared to theoretical bounds. How to substantially improve sample efficiency, compute efficiency? Active research; substantial gains possible.
OP-EST-2. Power-constrained training. As power becomes binding, how to maximize capability per gigawatt? Algorithmic efficiency, hardware efficiency, data efficiency.
OP-EST-3. Distributed training communication. Communication often dominates compute. Reducing communication is substantial efficiency frontier. Quantized gradients, asynchronous methods, better parallelism.
OP-EST-4. Beyond Transformers. Are there fundamentally more efficient architectures than Transformer for foundation-model scale? Mamba, RWKV, state-space models all compete. Open question.
OP-EST-5. Continual learning at scale. Frontier models are trained, then deployed. Updating them with new data is mostly retraining-from-checkpoint. Substantially more efficient continual learning would be valuable. Active research.
OP-EST-6. Federated training of frontier models. Can frontier capability be trained collaboratively across multiple organizations without sharing data or model weights? Privacy and competitive considerations. Active research.
OP-EST-7. Training-data licensing and copyright. Substantial unresolved legal and ethical questions about training-data sources. Multiple lawsuits pending. Regulatory developments. Substantial open issues.
OP-EST-8. Specialized hardware design. What hardware would be optimal for foundation-model training? GPUs are general-purpose; substantial inefficiencies. Custom AI chips (TPUs, Trainium) capture some efficiency. Substantial design-space exploration.
OP-EST-9. Training-cost transparency. Industry practice keeps training costs largely opaque. Better transparency would aid policy, research, and competitive analysis. Open issue.
OP-EST-10. Sustainable AI infrastructure. Long-term sustainability of current trajectory. Energy, hardware supply, capital, environmental impact. Substantial open questions about future trajectory.
§15. Further Reading
Below is an annotated reading list. Selection emphasizes high-impact, accessible-yet-deep references.
Foundational distributed training
Dean et al. (2012). “Large Scale Distributed Deep Networks.” DistBelief; foundational distributed training.
Goyal et al. (2017). “Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour.” Large-batch training.
Sergeev and Del Balso (2018). “Horovod.” Ring all-reduce.
Tensor and pipeline parallelism
Shoeybi et al. (2019). “Megatron-LM.” Tensor parallelism.
Huang et al. (2019). “GPipe.” Pipeline parallelism.
Narayanan et al. (2021). “Efficient Large-Scale Language Model Training on GPU Clusters.” Combined parallelism.
ZeRO and FSDP
Rajbhandari et al. (2020). “ZeRO.” Memory optimization.
Rajbhandari et al. (2021). “ZeRO-Infinity.” Extreme-scale.
Zhao et al. (2023). “PyTorch FSDP.”
Mixed precision
Micikevicius et al. (2017). “Mixed Precision Training.”
NVIDIA Transformer Engine documentation. FP8 training.
DeepSeek-V3 paper (2024). FP8 training at scale.
Attention efficiency
Dao et al. (2022). “FlashAttention.”
Dao (2023). “FlashAttention-2.”
Pope et al. (2023). “Efficiently Scaling Transformer Inference.”
MoE
Shazeer et al. (2017). “Outrageously Large Neural Networks.” Sparsely-Gated MoE.
Fedus et al. (2022). “Switch Transformers.”
Lepikhin et al. (2020). “GShard.” MoE at scale.
Gale et al. (2023). “MegaBlocks.” Efficient MoE.
Efficient fine-tuning
Houlsby et al. (2019). “Parameter-Efficient Transfer Learning for NLP.” Adapters.
Hu et al. (2021). “LoRA.” Low-Rank Adaptation.
Dettmers et al. (2023). “QLoRA.” 4-bit + LoRA.
Liu et al. (2024). “DoRA.”
Quantization
Dettmers et al. (2022). “LLM.int8().”
Frantar et al. (2022). “GPTQ.”
Lin et al. (2023). “AWQ.”
Xiao et al. (2022). “SmoothQuant.”
Scaling laws
Kaplan et al. (2020). “Scaling Laws for Neural Language Models.”
Hoffmann et al. (2022). “Training Compute-Optimal Large Language Models.” Chinchilla.
Reliability
Jiang et al. (2024). “MegaScale: Scaling Large Language Model Training to More Than 10,000 GPUs.” Reliability engineering.
Inference optimization
Kwon et al. (2023). “vLLM: Easy, Fast, and Cheap LLM Serving with PagedAttention.”
Leviathan et al. (2022). “Fast Inference from Transformers via Speculative Decoding.”
Energy and economics
Patterson et al. (2021). “Carbon Emissions and Large Neural Network Training.”
Patterson et al. (2022). “The Carbon Footprint of Machine Learning Training Will Plateau, Then Shrink.”
Frontier-training case studies
DeepSeek-V3 paper (2024). Frontier training methodology disclosure.
DeepSeek-R1 paper (2025). Reasoning RL methodology.
Llama 3 paper (Meta, 2024). Training methodology.
OPT paper (Meta, 2022). Open-pretrained transformer; first major open-weights frontier model.
Surveys
Hagemann et al. (2024). “Efficient Parallelization Layouts for Large-Scale Distributed Model Training.”
Pope et al. (2023). “Efficiently Scaling Transformer Inference.”
Various efficiency surveys.
Reading order
For a structured study path:
Start with parallelism basics. Megatron, GPipe, ZeRO papers.
Then mixed precision. Micikevicius 2017.
Then attention efficiency. Flash Attention.
Then MoE. Switch Transformers, GShard.
Then efficient fine-tuning. LoRA, QLoRA.
Then reliability. MegaScale.
Then frontier case studies. DeepSeek-V3 and R1 papers.
Then inference. vLLM and speculative decoding.
§16. Exercises and Experiments
The following exercises are research-style.
E1. Single-GPU baseline. Train a small Transformer (10M-100M params) on a small dataset (Tiny Stories, WikiText-2). Measure throughput, memory usage.
E2. Data parallelism scaling. Scale single-GPU training to 4-8 GPUs with DP (DDP in PyTorch). Measure throughput scaling efficiency. Identify communication bottlenecks.
E3. ZeRO/FSDP scaling. Train a model that doesn’t fit on a single GPU using FSDP. Measure memory savings and throughput.
E4. Mixed precision comparison. Train same model in FP32, BF16, FP8 (if hardware supports). Compare throughput, memory, final quality.
E5. Activation checkpointing analysis. With and without activation checkpointing, measure memory and compute trade-off.
E6. LoRA fine-tuning. Fine-tune a 7B-70B open-weights model with LoRA on a domain task. Measure quality vs full fine-tuning.
E7. QLoRA. QLoRA fine-tune an open-weights model on consumer GPUs. Document hardware requirements.
E8. Quantization study. Quantize an LLM to INT8, INT4. Measure quality degradation across benchmarks.
E9. Inference optimization. Set up vLLM or TensorRT-LLM. Measure throughput vs naive PyTorch generation.
E10. Cost analysis. For a specific model-training task, calculate cost across cloud providers and hardware options. Identify Pareto-optimal options.