Why prompt caching matters

LLM applications often see repeated or near-duplicate requests: customer support macros, recurring code scaffolds, standardized tutoring questions, and product Q&A. If every request is executed from scratch, you pay for the same prompt and completion tokens repeatedly and you wait for the model to generate the same output repeatedly. Prompt/response caching avoids that by storing a previous response (and/or intermediate model state, depending on the system) and reusing it when the same prompt is requested again.

This calculator provides a simple, transparent estimate of (1) token cost savings and (2) latency savings when repeats are served from cache instead of the model.

What the inputs mean

Total requests (N): the total number of calls you expect to serve in the period you’re analyzing.
Unique prompts (U): how many of those requests are distinct prompts that must be executed at least once. In this simplified model, each unique prompt is executed once and all remaining requests are cache hits.
Average prompt tokens (T_p): average tokens in the input/prompt (including system + developer + user messages if applicable).
Average completion tokens (T_c): average tokens generated by the model for a miss (i.e., when the model actually runs).
Cost per 1K tokens ($): a blended price per 1,000 tokens for prompt + completion. If your provider charges different rates for input vs output, approximate with a weighted blend or use a separate calculator.
Latency per token (ms): average end-to-end incremental latency per processed token. This is a simplification; real systems also have fixed overheads.

Formulas used

First compute the tokens processed per model execution:

T = T_p + T_c

Baseline (no caching):

Token volume: V_raw = N × T
Cost: C_raw = (V_raw / 1000) × C_1k
Latency: H_raw = V_raw × L_t

With caching (each unique prompt executed once):

Token volume: V_cache = U × T
Cost: C_cache = (V_cache / 1000) × C_1k
Latency: H_cache = V_cache × L_t

Savings:

Cost savings: S_C = C_raw − C_cache
Latency savings: S_H = H_raw − H_cache

Cache hit rate estimate (based purely on unique prompts vs total requests):

r = 1 − (U / N)

r = 1 - \frac{U}{N}

Interpreting the results

Raw vs cached cost: how much you’d spend if every request hits the model versus only unique prompts hitting the model.
Raw vs cached latency: a rough “token-processing time” comparison. Treat this as directional unless you’ve validated the ms/token figure for your stack.
Hit rate: higher hit rate generally means bigger savings. If U is close to N, caching won’t help much.

Worked example

Suppose:

N = 10,000 total requests
U = 2,000 unique prompts
T_p = 150 prompt tokens
T_c = 150 completion tokens
C_1k = $0.002 per 1K tokens (blended)
L_t = 5 ms/token

Then T = 300 tokens. Baseline token volume: V_raw = 10,000 × 300 = 3,000,000 tokens. Baseline cost: (3,000,000/1000)×0.002 = $6.00. Cached token volume: V_cache = 2,000 × 300 = 600,000 tokens. Cached cost: (600,000/1000)×0.002 = $1.20. Estimated cost savings: $4.80.

Latency: H_raw = 3,000,000 × 5 ms = 15,000,000 ms (15,000 s). H_cache = 600,000 × 5 ms = 3,000,000 ms (3,000 s). Estimated latency savings: 12,000 s. Hit rate: r = 1 - 2000/10000 = 0.8 (80%).

Baseline vs caching comparison

Metric	No caching	With caching (unique prompts only)
Model executions	N	U
Token volume	N × (T_p + T_c)	U × (T_p + T_c)
Token cost	(V/1000) × C_1k	(V/1000) × C_1k
Latency (token-based estimate)	V × L_t	V × L_t

Assumptions and limitations

Exact-match caching model: assumes every unique prompt is executed once and all repeats are served from cache. It does not model partial matches, semantic caching, or template-level caching.
Constant tokens: uses average prompt and completion tokens; real traffic has a distribution and tokenization varies by model.
Blended pricing: assumes a single cost per 1K tokens. Many providers price input and output differently, and may have additional charges (tools, storage, batch, etc.).
Latency per token is simplified: real latency includes fixed overhead (network, queueing, orchestration) and may not be linear with tokens. Streaming responses also change perceived latency.
Cache overhead not included: lookup time, serialization, storage, replication, and invalidation costs are excluded; in low-latency systems these can be non-trivial.
Freshness and personalization: caching can be inappropriate for prompts that depend on user-specific context, time-sensitive data, or evolving policies.
Privacy/compliance: storing prompts and completions may require redaction, encryption, access controls, or an opt-out strategy.

Prompt Caching Savings Calculator

Why prompt caching matters

What the inputs mean

Formulas used

Interpreting the results

Worked example

Baseline vs caching comparison

Assumptions and limitations

Embed this calculator

Prompt Caching Savings Calculator

Why prompt caching matters

What the inputs mean

Formulas used

Interpreting the results

Worked example

Baseline vs caching comparison

Assumptions and limitations

Embed this calculator

Related Calculators

RAG Query Cost and Latency Calculator

LLM Token Cost Calculator - Plan Your API Budget

AI Chatbot Response Latency Calculator - Estimate Reply Speed

LLM Response Cache ROI Calculator

Batch Inference Throughput and Latency Calculator

AI Video Generation Cost Calculator - Budget Animated Clips