RAG Query Cost and Latency Calculator

JJ Ben-Joseph headshot JJ Ben-Joseph

What this calculator estimates

Retrieval-augmented generation (RAG) turns a single user request into a small pipeline: (1) retrieve relevant documents/chunks from a vector database, (2) build an LLM prompt that includes your system instructions, chat history, and retrieved context, and (3) generate the final answer with the model. This calculator estimates cost and latency for that pipeline using the inputs you provide, and compares outcomes across caching scenarios.

It is designed for planning and “what-if” analysis: budgeting monthly spend, estimating the ROI of improving cache hit rate, choosing between models or vector stores, and understanding which knob (tokens, fan-out, cache hit rate, retrieval latency) is driving your p50-style response time and your unit economics.

Inputs explained (units and how they’re used)

Cost model

The total monthly cost is modeled as the sum of: (a) LLM inference (prompt + completion), (b) vector reads for retrieval, and (c) embedding refresh cost amortized across your monthly query volume. The calculator then reports per-query values by dividing by Q where appropriate.

1) LLM cost

Billable tokens per query are approximated as P + R. With a single blended price per 1K tokens (M):

LLM cost per query:

Cllm = (P+R)M 1000

2) Vector read cost

Each query triggers F retrievals. If your vector store charges V dollars per 1,000 retrievals:

Vector cost per query:

C_vector = (F × V) / 1000

3) Embedding refresh amortization

RAG systems often pay a recurring “silent” cost to keep the index fresh: re-embedding new/changed content. If you have E documents, each averaging T tokens to embed, and you refresh a fraction D of them per month, then:

To compare embedding upkeep to serving, the calculator amortizes the monthly embedding cost over monthly queries Q:

Embedding refresh cost per query:

C_embed = (E × T × D × p) / (1000 × Q)

Cold-cache vs warm-cache cost

This calculator treats a cache hit as avoiding both retrieval and LLM inference for the cached portion of queries (i.e., a hit returns a stored answer). Under that simplifying assumption:

Note that embedding refresh is typically paid regardless of cache performance, so it is not reduced by h in this model.

Latency model

Latency is modeled as retrieval time plus base model time. Retrieval is approximated as fan-out times per-retrieval latency (linear fan-out):

In practice, cached responses still incur some overhead (routing, cache lookup, network). If you want that reflected, subtract less than the full h benefit by using a lower effective hit rate.

Worked example (using the default inputs)

Defaults: Q=100,000, P=800, R=600, M=$0.003, F=8, h=0.35, L_r=45ms, L_llm=700ms, V=$0.15, p=$0.0001, E=250,000, T=750, D%=20 (so D=0.20).

Cold cost/query0.0042 + 0.0012 + 0.0000375 = $0.0054375 (~$0.0054). That implies monthly cold serving$543.75.

Warm cost/query (35% hits) ≈ (1−0.35)×(0.0042+0.0012) + 0.00003750.65×0.0054 + 0.0000375$0.0035475 (~$0.00355). Monthly warm serving$354.75.

Cold latency700 + 8×45 = 1060ms. Warm latency under the simplified hit model ≈ 0.65×1060 = 689ms. Interpretation: with these parameters, latency is dominated by base LLM time, while cost is split between the model and retrieval reads; embeddings are small per query but can matter if query volume is low or refresh rates are high.

How to interpret the results

Scenario comparison (what changes when you improve caching/fan-out)

Lever What you change Primary impact Secondary effects / notes
Increase cache hit rate (h) Better cache keys, longer TTLs, semantic caching Reduces serving cost and latency on cached queries In this model, embeddings are not reduced by caching
Reduce prompt tokens (P) Smaller context, better chunking, tighter instructions Strong cost reduction; often latency reduction May impact answer quality if context becomes insufficient
Reduce fan-out (F) Tighter filtering, reranking, better embeddings Lower vector cost and retrieval latency Also can reduce prompt tokens if fewer passages are included
Lower LLM latency (L_llm) Faster model, lower max tokens, higher throughput Direct latency improvement May change price per 1K tokens (M) and output quality
Lower refresh rate (D%) Refresh only changed docs, incremental pipelines Lowers embedding overhead Risk of stale answers if freshness requirements are strict

Assumptions and limitations

Sourcing note

Use your provider’s current published pricing for M and p, your vector database pricing for V, and measured production timings for L_r and L_llm. The most reliable way to calibrate is to sample real traces and plug in p50 and p95 as separate runs.

Enter workload characteristics to compute cost and latency outcomes for your retrieval augmented generation pipeline.

Embed this calculator

Copy and paste the HTML below to add the RAG Query Cost and Latency Calculator to your website.