Empirical scaling laws summarize how training loss tends to improve as you increase training resources. This calculator focuses on the common “data scaling” relationship where loss decreases as a power law in the number of training tokens.
What this calculator estimates
Given a baseline training run with token count N0 and observed training loss L0, plus a scaling exponent α and an irreducible loss floor B, the calculator:
- Fits the constant A implied by the baseline point.
- Projects the expected loss at a new token count N1.
- Optionally estimates how many tokens would be required to reach a target loss.
Definitions and units
- Tokens (N): total training tokens consumed (often “tokens seen,” including repeats across epochs). Use the same definition for N0 and N1.
- Loss (L): typically training cross-entropy / negative log-likelihood (NLL) averaged per token. It is not accuracy.
- Irreducible loss (B): a floor the power law approaches as N grows; it captures limits from data quality, architecture, optimization, label noise, and evaluation setup.
- Scaling exponent (α): positive number controlling diminishing returns. Larger α means faster improvement with added tokens.
The scaling-law formula
The calculator uses the common form:
L(N) = A × N−α + B
Presented in MathML:
Solving for A from the baseline
Using the baseline observation (N0, L0):
A = (L0 − B) × N0α
This requires L0 > B. If L0 is less than or equal to B, the fitted A is non-positive and the model no longer represents a diminishing-loss curve.
Projecting loss at N1
Once A is known, the projected loss at N1 is:
L(N1) = A × N1−α + B
Solving for tokens needed to reach a target loss
If you provide a target loss Ltarget (must satisfy Ltarget > B), then:
Ntarget = (A / (Ltarget − B))1/α
How to interpret the results
- Projected loss is an estimate of training loss after convergence under a similar training recipe. It does not automatically translate to downstream task performance.
- Marginal gains shrink as N increases because N−α flattens; the curve asymptotically approaches B.
- Sensitivity to α is high: small changes in α can cause large changes in Ntarget, especially when you aim close to B.
- Extrapolation risk rises the further N1 is from N0. Scaling laws often hold over ranges, not indefinitely.
Worked example (matches the default inputs)
Suppose you observed:
- N0 = 1,000,000 tokens
- L0 = 2.5
- α = 0.1
- B = 1.0
- N1 = 5,000,000 tokens
First compute A:
- N0α = (1,000,000)0.1 = 100.6 ≈ 3.981
- A = (2.5 − 1.0) × 3.981 ≈ 5.972
Now project loss at N1:
- N1−α = (5,000,000)−0.1 ≈ 0.214
- L(N1) ≈ 5.972 × 0.214 + 1.0 ≈ 2.28
If you also set Ltarget = 1.5:
- Ntarget = (5.972 / (1.5 − 1.0))1/0.1 = (11.944)10 ≈ 6.0 × 1010 tokens
This illustrates a common takeaway: pushing training loss close to the floor B can require enormous increases in tokens.
Quick comparison: what changes when you scale data?
| Change |
What happens to L(N)? |
Practical implication |
| Increase N (more tokens) |
L decreases roughly as N−α until it nears B |
Diminishing returns; biggest gains are earlier |
| Increase α |
Curve falls faster with N |
Fewer extra tokens needed for the same loss drop |
| Increase B |
Loss floor rises; all projections shift upward |
Data/architecture/label noise may be limiting |
| Increase L0 with same N0 |
Implied A increases |
Worse baseline implies higher losses at all N unless you change B or α |
Assumptions & limitations (read before acting on projections)
- Single-factor scaling: This page only scales with token count. If you also change model size, context length, optimizer, batch size, training schedule, data mixture, or augmentation, the fitted A, α, and B may change.
- Training vs validation: The formula is most often reported for training loss (or validation loss under fixed evaluation). If your L0 is validation loss, keep that consistent for all comparisons, and expect different parameters than training loss.
- Domain and data quality dependence: Scaling behavior depends strongly on dataset distribution and cleanliness. Adding low-quality tokens can reduce effective gains; in extreme cases it can worsen outcomes even if N increases.
- Requirement that L > B: You must choose B smaller than your observed and target losses. If B is set too high, the algebra implies negative/undefined values.
- Extrapolation limits: Power-law fits are empirical. Predictions far beyond the baseline (orders of magnitude) can be unreliable—use them as planning heuristics, not guarantees.
- Not a downstream KPI forecast: A lower cross-entropy does not map linearly to metrics like accuracy, win rate, or task-specific scores; improvements may saturate differently.
- Token accounting ambiguity: “Tokens” may mean unique tokens, total tokens seen, or effective tokens after deduplication/filtering. Mixing conventions will distort results.
Practical tips
- If you have multiple runs, fit α and B (or at least α) from data rather than borrowing a generic value.
- Use N1/N0 as a sanity check: if you only scale tokens by 2× and α is small (e.g., 0.05–0.1), expect modest loss changes.
- Treat Ntarget near B as a warning sign: it often implies that improving data quality or architecture may be more cost-effective than brute-force scaling.
Enter baseline metrics and scaling parameters to project performance.