Training large neural networks often runs into a hard constraint: GPU memory. Even when parameter weights fit, the activations saved for backpropagation can dominate memory usage—especially for long sequence lengths and large batch sizes. Gradient checkpointing (also called activation checkpointing) reduces this activation memory by storing only a subset of intermediate activations during the forward pass and recomputing the missing activations during the backward pass. This page’s calculator turns that qualitative “memory vs time” trade-off into a simple quantitative estimate.
I layers (a “segment” length).I saves more memory but increases recomputation.The calculator separates memory into (a) parameter memory and (b) activation memory. Parameter memory is straightforward:
Activation memory is modeled assuming each layer holds a hidden-state tensor of shape roughly [Bs, S, H]. A simple baseline estimate (no checkpointing) is:
M_a = 2 × H × S × L × B_s × b
The factor of 2 is a crude way to account for storing forward activations and backward-related buffers. Different frameworks and kernels can make this factor meaningfully different, so treat it as an approximation.
With checkpointing, you store only the boundary activations for each segment and recompute the interior activations during backprop. In this simplified model, activation memory scales with the segment length I instead of total layers L:
M_c = 2 × H × S × I × B_s × b
Memory saved:
S_m = M_a − M_c% saved = (S_m / M_a) × 100%The basic intuition: checkpointing adds extra forward computations during backward because missing activations must be recomputed. A common back-of-the-envelope model is:
T_c = T_b × (1 + L / (2I))
Here, L/I is the number of segments, and 1/2 assumes a forward pass is about half the cost of a full step (forward+backward). Real models can deviate depending on attention implementation, activation recompute efficiency, kernel fusion, and communication overlap.
S or Bs at the expense of slower steps.I → L).Suppose you train a 7B-parameter transformer with:
P = 7 (billions of params)H = 4096, L = 32S = 1024, B_s = 2b = 2 bytes (bf16/fp16 activations)I = 4 layers per segmentT_b = 1.5 sBaseline activation memory:
M_a = 2 × 4096 × 1024 × 32 × 2 × 2 = 1,073,741,824 bytes ≈ 1.00 GiB
Checkpointed activation memory:
M_c = 2 × 4096 × 1024 × 4 × 2 × 2 = 134,217,728 bytes ≈ 0.125 GiB
Saved:
S_m ≈ 0.875 GiB% saved ≈ 87.5%Time overhead:
T_c = 1.5 × (1 + 32/(2×4)) = 1.5 × (1 + 4) = 7.5 s
This is a deliberately simple model, but it shows the core trade: a large activation-memory reduction can come with a large recomputation penalty when I is small.
| Checkpoint interval I (layers) | Activation memory scaling | Estimated time multiplier | Typical use-case |
|---|---|---|---|
| 1 | ~1/L of baseline | ~1 + L/2 | Extreme memory pressure; expect large slowdown |
| 4 | ~4/L of baseline | ~1 + L/8 | Common compromise for many transformer stacks |
| 8 | ~8/L of baseline | ~1 + L/16 | Moderate savings with milder overhead |
| L (no checkpointing) | Baseline | ~1× | When you have enough memory or want max speed |
S hidden-state approximation.b value is an approximation.Bs is per-device microbatch. With gradient accumulation, the effective global batch can be larger without changing per-step activation memory.(1 + L/(2I)) assumes recompute cost is proportional to an extra forward per segment and that forward is ~half a step. Kernel fusion, activation recompute efficiency, and communication overlap can push real overhead above or below this estimate.L, H, Bs to memory/time changes; consider this a single-device approximation unless you adapt inputs accordingly.I=8 or I=16) to reduce overhead.S (long context), smaller intervals (e.g., I=1–4) can help—budget for slower steps and validate with a profiler.