What this calculator estimates

Retrieval-augmented generation (RAG) turns a single user request into a small pipeline: (1) retrieve relevant documents/chunks from a vector database, (2) build an LLM prompt that includes your system instructions, chat history, and retrieved context, and (3) generate the final answer with the model. This calculator estimates cost and latency for that pipeline using the inputs you provide, and compares outcomes across caching scenarios.

It is designed for planning and “what-if” analysis: budgeting monthly spend, estimating the ROI of improving cache hit rate, choosing between models or vector stores, and understanding which knob (tokens, fan-out, cache hit rate, retrieval latency) is driving your p50-style response time and your unit economics.

Inputs explained (units and how they’re used)

Monthly Queries (Q): total queries served per month.
Prompt Tokens per Query (P): tokens sent to the model per query (instructions + history + retrieved context). If your provider charges different prompt/completion rates, this calculator assumes a blended rate (see limitations).
Completion Tokens per Query (R): tokens generated by the model per query.
Model Price per 1K Tokens ($) (M): price per 1,000 tokens. The calculator treats P + R as billable tokens at this single rate.
Retrieval Fan-out (documents) (F): how many documents/chunks you fetch per query. Each “document” here means one vector read operation (or equivalent) for pricing purposes.
Cache Hit Rate (0–1) (h): fraction of queries served from cache. Enter as a decimal: 0.35 = 35%.
Latency per Retrieval (ms) (L_r): average time to perform one retrieval. Used linearly with fan-out as F × L_r.
Base LLM Latency (ms) (L_llm): average model latency excluding retrieval (think: prompt processing + generation time under typical load).
Vector Read Cost per 1K Retrievals ($) (V): cost per 1,000 retrieval operations.
Embedding Price per 1K Tokens ($) (p): price per 1,000 tokens to embed documents during refresh/reindex.
Documents in Index (E): total documents/chunks stored.
Average Tokens per Document (T): approximate tokens per document/chunk that you embed.
Monthly Refresh Percentage (%) (D%): what fraction of the index is re-embedded per month (0–100). In formulas this becomes D = D% / 100.

Cost model

The total monthly cost is modeled as the sum of: (a) LLM inference (prompt + completion), (b) vector reads for retrieval, and (c) embedding refresh cost amortized across your monthly query volume. The calculator then reports per-query values by dividing by Q where appropriate.

1) LLM cost

Billable tokens per query are approximated as P + R. With a single blended price per 1K tokens (M):

LLM cost per query:

C_{llm} = \frac{(P + R) M}{1000}

2) Vector read cost

Each query triggers F retrievals. If your vector store charges V dollars per 1,000 retrievals:

Vector cost per query:

C_vector = (F × V) / 1000

3) Embedding refresh amortization

RAG systems often pay a recurring “silent” cost to keep the index fresh: re-embedding new/changed content. If you have E documents, each averaging T tokens to embed, and you refresh a fraction D of them per month, then:

Monthly embedding tokens: E × T × D
Monthly embedding cost: (E × T × D × p) / 1000

To compare embedding upkeep to serving, the calculator amortizes the monthly embedding cost over monthly queries Q:

Embedding refresh cost per query:

C_embed = (E × T × D × p) / (1000 × Q)

Cold-cache vs warm-cache cost

This calculator treats a cache hit as avoiding both retrieval and LLM inference for the cached portion of queries (i.e., a hit returns a stored answer). Under that simplifying assumption:

Cold-cache (no cache): C_cold = C_llm + C_vector + C_embed
Warm-cache (hit rate h): C_warm = (1 − h) × (C_llm + C_vector) + C_embed

Note that embedding refresh is typically paid regardless of cache performance, so it is not reduced by h in this model.

Latency model

Latency is modeled as retrieval time plus base model time. Retrieval is approximated as fan-out times per-retrieval latency (linear fan-out):

Cold latency: L_cold = L_llm + (F × L_r)
Warm latency: L_warm = (1 − h) × L_cold (cache hits are treated as ~0ms incremental compute in this simplified view)

In practice, cached responses still incur some overhead (routing, cache lookup, network). If you want that reflected, subtract less than the full h benefit by using a lower effective hit rate.

Worked example (using the default inputs)

Defaults: Q=100,000, P=800, R=600, M=$0.003, F=8, h=0.35, L_r=45ms, L_llm=700ms, V=$0.15, p=$0.0001, E=250,000, T=750, D%=20 (so D=0.20).

LLM cost/query: ((800+600)/1000)×0.003 = 1.4×0.003 = $0.0042
Vector cost/query: (8×0.15)/1000 = $0.0012
Embedding monthly cost: tokens = 250,000×750×0.20 = 37,500,000; cost = 37,500,000/1000×0.0001 = $3.75
Embedding cost/query: $3.75 / 100,000 = $0.0000375

Cold cost/query ≈ 0.0042 + 0.0012 + 0.0000375 = $0.0054375 (~$0.0054). That implies monthly cold serving ≈ $543.75.

Warm cost/query (35% hits) ≈ (1−0.35)×(0.0042+0.0012) + 0.0000375 ≈ 0.65×0.0054 + 0.0000375 ≈ $0.0035475 (~$0.00355). Monthly warm serving ≈ $354.75.

Cold latency ≈ 700 + 8×45 = 1060ms. Warm latency under the simplified hit model ≈ 0.65×1060 = 689ms. Interpretation: with these parameters, latency is dominated by base LLM time, while cost is split between the model and retrieval reads; embeddings are small per query but can matter if query volume is low or refresh rates are high.

How to interpret the results

If per-query cost is high, first check token counts (P and R). Large retrieved context or long completions usually dominate.
If retrieval cost is high, reduce F, improve filtering, or add a reranker to keep fan-out small while preserving quality.
If embedding amortization looks large, it typically means either (a) low query volume Q (so fixed refresh is spread over few queries) or (b) high refresh percentage and large corpus.
If latency is high, determine whether it’s L_llm-dominated (model choice, output length, throughput limits) or retrieval-dominated (F and L_r).

Scenario comparison (what changes when you improve caching/fan-out)

Lever	What you change	Primary impact	Secondary effects / notes
Increase cache hit rate (h)	Better cache keys, longer TTLs, semantic caching	Reduces serving cost and latency on cached queries	In this model, embeddings are not reduced by caching
Reduce prompt tokens (P)	Smaller context, better chunking, tighter instructions	Strong cost reduction; often latency reduction	May impact answer quality if context becomes insufficient
Reduce fan-out (F)	Tighter filtering, reranking, better embeddings	Lower vector cost and retrieval latency	Also can reduce prompt tokens if fewer passages are included
Lower LLM latency (L_llm)	Faster model, lower max tokens, higher throughput	Direct latency improvement	May change price per 1K tokens (M) and output quality
Lower refresh rate (D%)	Refresh only changed docs, incremental pipelines	Lowers embedding overhead	Risk of stale answers if freshness requirements are strict

Assumptions and limitations

Blended token price: Many providers price prompt and completion tokens differently. This calculator uses one rate (M) for both.
Cache hit semantics: A cache hit is treated as avoiding both retrieval and LLM inference (returning a stored final answer). If you only cache retrieval results or prompt prefixes, real savings differ.
Latency is an average-style estimate: Real systems care about p95/p99. Queueing, throttling, cold starts, and regional network variance are not modeled.
Retrieval latency linearity: The formula uses F × L_r. Some systems parallelize retrievals, making latency closer to max(L_r) plus overhead.
Not included: reranker model costs, tool/function calls, retries, guardrails/moderation, streaming overhead, vector index build costs, writes, storage, observability, and application/server costs.
Token estimates: Tokens per document and per query can vary widely depending on chunking strategy and prompt template. Treat results as directional until validated with logging.

Sourcing note

Use your provider’s current published pricing for M and p, your vector database pricing for V, and measured production timings for L_r and L_llm. The most reliable way to calibrate is to sample real traces and plug in p50 and p95 as separate runs.

RAG Query Cost and Latency Calculator

What this calculator estimates

Inputs explained (units and how they’re used)

Cost model

1) LLM cost

2) Vector read cost

3) Embedding refresh amortization

Cold-cache vs warm-cache cost

Latency model

Worked example (using the default inputs)

How to interpret the results

Scenario comparison (what changes when you improve caching/fan-out)

Assumptions and limitations

Sourcing note

Embed this calculator

RAG Query Cost and Latency Calculator

What this calculator estimates

Inputs explained (units and how they’re used)

Cost model

1) LLM cost

2) Vector read cost

3) Embedding refresh amortization

Cold-cache vs warm-cache cost

Latency model

Worked example (using the default inputs)

How to interpret the results

Scenario comparison (what changes when you improve caching/fan-out)

Assumptions and limitations

Sourcing note

Embed this calculator

Related Calculators

RAG Vector Database Sizing and Latency Planner

Prompt Caching Savings Calculator

LLM Response Cache ROI Calculator

Batch Inference Throughput and Latency Calculator

AI Chatbot Response Latency Calculator - Estimate Reply Speed

Model Ensemble Inference Cost Calculator