Empirical scaling laws summarize how training loss tends to improve as you increase training resources. This calculator focuses on the common “data scaling” relationship where loss decreases as a power law in the number of training tokens.

What this calculator estimates

Given a baseline training run with token count N₀ and observed training loss L₀, plus a scaling exponent α and an irreducible loss floor B, the calculator:

Fits the constant A implied by the baseline point.
Projects the expected loss at a new token count N₁.
Optionally estimates how many tokens would be required to reach a target loss.

Definitions and units

Tokens (N): total training tokens consumed (often “tokens seen,” including repeats across epochs). Use the same definition for N₀ and N₁.
Loss (L): typically training cross-entropy / negative log-likelihood (NLL) averaged per token. It is not accuracy.
Irreducible loss (B): a floor the power law approaches as N grows; it captures limits from data quality, architecture, optimization, label noise, and evaluation setup.
Scaling exponent (α): positive number controlling diminishing returns. Larger α means faster improvement with added tokens.

The scaling-law formula

The calculator uses the common form:

L(N) = A × N^−α + B

Presented in MathML:

L (N) = A N^{- α} + B

Solving for A from the baseline

Using the baseline observation (N₀, L₀):

A = (L₀ − B) × N₀^α

This requires L₀ > B. If L₀ is less than or equal to B, the fitted A is non-positive and the model no longer represents a diminishing-loss curve.

Projecting loss at N1

Once A is known, the projected loss at N₁ is:

L(N₁) = A × N₁^−α + B

Solving for tokens needed to reach a target loss

If you provide a target loss L_target (must satisfy L_target > B), then:

N_target = (A / (L_target − B))^1/α

How to interpret the results

Projected loss is an estimate of training loss after convergence under a similar training recipe. It does not automatically translate to downstream task performance.
Marginal gains shrink as N increases because N^−α flattens; the curve asymptotically approaches B.
Sensitivity to α is high: small changes in α can cause large changes in N_target, especially when you aim close to B.
Extrapolation risk rises the further N₁ is from N₀. Scaling laws often hold over ranges, not indefinitely.

Worked example (matches the default inputs)

Suppose you observed:

N₀ = 1,000,000 tokens
L₀ = 2.5
α = 0.1
B = 1.0
N₁ = 5,000,000 tokens

First compute A:

N₀^α = (1,000,000)^0.1 = 10^0.6 ≈ 3.981
A = (2.5 − 1.0) × 3.981 ≈ 5.972

Now project loss at N₁:

N₁^−α = (5,000,000)^−0.1 ≈ 0.214
L(N₁) ≈ 5.972 × 0.214 + 1.0 ≈ 2.28

If you also set L_target = 1.5:

N_target = (5.972 / (1.5 − 1.0))^1/0.1 = (11.944)¹⁰ ≈ 6.0 × 10¹⁰ tokens

This illustrates a common takeaway: pushing training loss close to the floor B can require enormous increases in tokens.

Quick comparison: what changes when you scale data?

Change	What happens to L(N)?	Practical implication
Increase N (more tokens)	L decreases roughly as N^−α until it nears B	Diminishing returns; biggest gains are earlier
Increase α	Curve falls faster with N	Fewer extra tokens needed for the same loss drop
Increase B	Loss floor rises; all projections shift upward	Data/architecture/label noise may be limiting
Increase L₀ with same N₀	Implied A increases	Worse baseline implies higher losses at all N unless you change B or α

Assumptions & limitations (read before acting on projections)

Single-factor scaling: This page only scales with token count. If you also change model size, context length, optimizer, batch size, training schedule, data mixture, or augmentation, the fitted A, α, and B may change.
Training vs validation: The formula is most often reported for training loss (or validation loss under fixed evaluation). If your L₀ is validation loss, keep that consistent for all comparisons, and expect different parameters than training loss.
Domain and data quality dependence: Scaling behavior depends strongly on dataset distribution and cleanliness. Adding low-quality tokens can reduce effective gains; in extreme cases it can worsen outcomes even if N increases.
Requirement that L > B: You must choose B smaller than your observed and target losses. If B is set too high, the algebra implies negative/undefined values.
Extrapolation limits: Power-law fits are empirical. Predictions far beyond the baseline (orders of magnitude) can be unreliable—use them as planning heuristics, not guarantees.
Not a downstream KPI forecast: A lower cross-entropy does not map linearly to metrics like accuracy, win rate, or task-specific scores; improvements may saturate differently.
Token accounting ambiguity: “Tokens” may mean unique tokens, total tokens seen, or effective tokens after deduplication/filtering. Mixing conventions will distort results.

Practical tips

If you have multiple runs, fit α and B (or at least α) from data rather than borrowing a generic value.
Use N₁/N₀ as a sanity check: if you only scale tokens by 2× and α is small (e.g., 0.05–0.1), expect modest loss changes.
Treat N_target near B as a warning sign: it often implies that improving data quality or architecture may be more cost-effective than brute-force scaling.

Model Scaling Law Performance Calculator

What this calculator estimates

Definitions and units

The scaling-law formula

Solving for A from the baseline

Projecting loss at N1

Solving for tokens needed to reach a target loss

How to interpret the results

Worked example (matches the default inputs)

Quick comparison: what changes when you scale data?

Assumptions & limitations (read before acting on projections)

Practical tips

Embed this calculator

Model Scaling Law Performance Calculator

What this calculator estimates

Definitions and units

The scaling-law formula

Solving for A from the baseline

Projecting loss at N1

Solving for tokens needed to reach a target loss

How to interpret the results

Worked example (matches the default inputs)

Quick comparison: what changes when you scale data?

Assumptions & limitations (read before acting on projections)

Practical tips

Embed this calculator

Related Calculators

Dataset Deduplication Savings Calculator

Context Window Scaling Cost Calculator

LLM Fine-Tuning Compute Cost Estimator

AI Image Generation Cost Calculator - Budget Art with Tokens

AI Training Compute Cost Calculator - Estimate Training Expense

Capital Loss Carryover Calculator