Why prompt caching matters
LLM applications often see repeated or near-duplicate requests: customer support macros, recurring code scaffolds, standardized tutoring questions, and product Q&A. If every request is executed from scratch, you pay for the same prompt and completion tokens repeatedly and you wait for the model to generate the same output repeatedly. Prompt/response caching avoids that by storing a previous response (and/or intermediate model state, depending on the system) and reusing it when the same prompt is requested again.
This calculator provides a simple, transparent estimate of (1) token cost savings and (2) latency savings when repeats are served from cache instead of the model.
What the inputs mean
- Total requests (N): the total number of calls you expect to serve in the period you’re analyzing.
- Unique prompts (U): how many of those requests are distinct prompts that must be executed at least once. In this simplified model, each unique prompt is executed once and all remaining requests are cache hits.
- Average prompt tokens (Tp): average tokens in the input/prompt (including system + developer + user messages if applicable).
- Average completion tokens (Tc): average tokens generated by the model for a miss (i.e., when the model actually runs).
- Cost per 1K tokens ($): a blended price per 1,000 tokens for prompt + completion. If your provider charges different rates for input vs output, approximate with a weighted blend or use a separate calculator.
- Latency per token (ms): average end-to-end incremental latency per processed token. This is a simplification; real systems also have fixed overheads.
Formulas used
First compute the tokens processed per model execution:
T = Tp + Tc
Baseline (no caching):
- Token volume: Vraw = N × T
- Cost: Craw = (Vraw / 1000) × C1k
- Latency: Hraw = Vraw × Lt
With caching (each unique prompt executed once):
- Token volume: Vcache = U × T
- Cost: Ccache = (Vcache / 1000) × C1k
- Latency: Hcache = Vcache × Lt
Savings:
- Cost savings: SC = Craw − Ccache
- Latency savings: SH = Hraw − Hcache
Cache hit rate estimate (based purely on unique prompts vs total requests):
r = 1 − (U / N)
Interpreting the results
- Raw vs cached cost: how much you’d spend if every request hits the model versus only unique prompts hitting the model.
- Raw vs cached latency: a rough “token-processing time” comparison. Treat this as directional unless you’ve validated the ms/token figure for your stack.
- Hit rate: higher hit rate generally means bigger savings. If U is close to N, caching won’t help much.
Worked example
Suppose:
- N = 10,000 total requests
- U = 2,000 unique prompts
- Tp = 150 prompt tokens
- Tc = 150 completion tokens
- C1k = $0.002 per 1K tokens (blended)
- Lt = 5 ms/token
Then T = 300 tokens. Baseline token volume: Vraw = 10,000 × 300 = 3,000,000 tokens. Baseline cost: (3,000,000/1000)×0.002 = $6.00. Cached token volume: Vcache = 2,000 × 300 = 600,000 tokens. Cached cost: (600,000/1000)×0.002 = $1.20. Estimated cost savings: $4.80.
Latency: Hraw = 3,000,000 × 5 ms = 15,000,000 ms (15,000 s). Hcache = 600,000 × 5 ms = 3,000,000 ms (3,000 s). Estimated latency savings: 12,000 s. Hit rate: r = 1 - 2000/10000 = 0.8 (80%).
Baseline vs caching comparison
| Metric |
No caching |
With caching (unique prompts only) |
| Model executions |
N |
U |
| Token volume |
N × (Tp + Tc) |
U × (Tp + Tc) |
| Token cost |
(V/1000) × C1k |
(V/1000) × C1k |
| Latency (token-based estimate) |
V × Lt |
V × Lt |
Assumptions and limitations
- Exact-match caching model: assumes every unique prompt is executed once and all repeats are served from cache. It does not model partial matches, semantic caching, or template-level caching.
- Constant tokens: uses average prompt and completion tokens; real traffic has a distribution and tokenization varies by model.
- Blended pricing: assumes a single cost per 1K tokens. Many providers price input and output differently, and may have additional charges (tools, storage, batch, etc.).
- Latency per token is simplified: real latency includes fixed overhead (network, queueing, orchestration) and may not be linear with tokens. Streaming responses also change perceived latency.
- Cache overhead not included: lookup time, serialization, storage, replication, and invalidation costs are excluded; in low-latency systems these can be non-trivial.
- Freshness and personalization: caching can be inappropriate for prompts that depend on user-specific context, time-sensitive data, or evolving policies.
- Privacy/compliance: storing prompts and completions may require redaction, encryption, access controls, or an opt-out strategy.