Retrieval-augmented generation (RAG) turns a single user request into a small pipeline: (1) retrieve relevant documents/chunks from a vector database, (2) build an LLM prompt that includes your system instructions, chat history, and retrieved context, and (3) generate the final answer with the model. This calculator estimates cost and latency for that pipeline using the inputs you provide, and compares outcomes across caching scenarios.
It is designed for planning and “what-if” analysis: budgeting monthly spend, estimating the ROI of improving cache hit rate, choosing between models or vector stores, and understanding which knob (tokens, fan-out, cache hit rate, retrieval latency) is driving your p50-style response time and your unit economics.
Q): total queries served per month.P): tokens sent to the model per query (instructions + history + retrieved context). If your provider charges different prompt/completion rates, this calculator assumes a blended rate (see limitations).R): tokens generated by the model per query.M): price per 1,000 tokens. The calculator treats P + R as billable tokens at this single rate.F): how many documents/chunks you fetch per query. Each “document” here means one vector read operation (or equivalent) for pricing purposes.h): fraction of queries served from cache. Enter as a decimal: 0.35 = 35%.L_r): average time to perform one retrieval. Used linearly with fan-out as F × L_r.L_llm): average model latency excluding retrieval (think: prompt processing + generation time under typical load).V): cost per 1,000 retrieval operations.p): price per 1,000 tokens to embed documents during refresh/reindex.E): total documents/chunks stored.T): approximate tokens per document/chunk that you embed.D%): what fraction of the index is re-embedded per month (0–100). In formulas this becomes D = D% / 100.The total monthly cost is modeled as the sum of: (a) LLM inference (prompt + completion), (b) vector reads for retrieval, and (c) embedding refresh cost amortized across your monthly query volume. The calculator then reports per-query values by dividing by Q where appropriate.
Billable tokens per query are approximated as P + R. With a single blended price per 1K tokens (M):
LLM cost per query:
Each query triggers F retrievals. If your vector store charges V dollars per 1,000 retrievals:
Vector cost per query:
C_vector = (F × V) / 1000
RAG systems often pay a recurring “silent” cost to keep the index fresh: re-embedding new/changed content. If you have E documents, each averaging T tokens to embed, and you refresh a fraction D of them per month, then:
E × T × D(E × T × D × p) / 1000To compare embedding upkeep to serving, the calculator amortizes the monthly embedding cost over monthly queries Q:
Embedding refresh cost per query:
C_embed = (E × T × D × p) / (1000 × Q)
This calculator treats a cache hit as avoiding both retrieval and LLM inference for the cached portion of queries (i.e., a hit returns a stored answer). Under that simplifying assumption:
C_cold = C_llm + C_vector + C_embedC_warm = (1 − h) × (C_llm + C_vector) + C_embedNote that embedding refresh is typically paid regardless of cache performance, so it is not reduced by h in this model.
Latency is modeled as retrieval time plus base model time. Retrieval is approximated as fan-out times per-retrieval latency (linear fan-out):
L_cold = L_llm + (F × L_r)L_warm = (1 − h) × L_cold (cache hits are treated as ~0ms incremental compute in this simplified view)In practice, cached responses still incur some overhead (routing, cache lookup, network). If you want that reflected, subtract less than the full h benefit by using a lower effective hit rate.
Defaults: Q=100,000, P=800, R=600, M=$0.003, F=8, h=0.35, L_r=45ms, L_llm=700ms, V=$0.15, p=$0.0001, E=250,000, T=750, D%=20 (so D=0.20).
((800+600)/1000)×0.003 = 1.4×0.003 = $0.0042(8×0.15)/1000 = $0.0012= 250,000×750×0.20 = 37,500,000; cost = 37,500,000/1000×0.0001 = $3.75$3.75 / 100,000 = $0.0000375Cold cost/query ≈ 0.0042 + 0.0012 + 0.0000375 = $0.0054375 (~$0.0054). That implies monthly cold serving ≈ $543.75.
Warm cost/query (35% hits) ≈ (1−0.35)×(0.0042+0.0012) + 0.0000375 ≈ 0.65×0.0054 + 0.0000375 ≈ $0.0035475 (~$0.00355). Monthly warm serving ≈ $354.75.
Cold latency ≈ 700 + 8×45 = 1060ms. Warm latency under the simplified hit model ≈ 0.65×1060 = 689ms. Interpretation: with these parameters, latency is dominated by base LLM time, while cost is split between the model and retrieval reads; embeddings are small per query but can matter if query volume is low or refresh rates are high.
P and R). Large retrieved context or long completions usually dominate.F, improve filtering, or add a reranker to keep fan-out small while preserving quality.Q (so fixed refresh is spread over few queries) or (b) high refresh percentage and large corpus.L_llm-dominated (model choice, output length, throughput limits) or retrieval-dominated (F and L_r).| Lever | What you change | Primary impact | Secondary effects / notes |
|---|---|---|---|
| Increase cache hit rate (h) | Better cache keys, longer TTLs, semantic caching | Reduces serving cost and latency on cached queries | In this model, embeddings are not reduced by caching |
| Reduce prompt tokens (P) | Smaller context, better chunking, tighter instructions | Strong cost reduction; often latency reduction | May impact answer quality if context becomes insufficient |
| Reduce fan-out (F) | Tighter filtering, reranking, better embeddings | Lower vector cost and retrieval latency | Also can reduce prompt tokens if fewer passages are included |
| Lower LLM latency (L_llm) | Faster model, lower max tokens, higher throughput | Direct latency improvement | May change price per 1K tokens (M) and output quality |
| Lower refresh rate (D%) | Refresh only changed docs, incremental pipelines | Lowers embedding overhead | Risk of stale answers if freshness requirements are strict |
M) for both.F × L_r. Some systems parallelize retrievals, making latency closer to max(L_r) plus overhead.Use your provider’s current published pricing for M and p, your vector database pricing for V, and measured production timings for L_r and L_llm. The most reliable way to calibrate is to sample real traces and plug in p50 and p95 as separate runs.