Founders10 min read

RAG Cost Per Query: The Full Breakdown (2026)

By Ergini, Software & AI Developer in Pristina, Kosovo

TL;DR

Most teams under-budget RAG by 3x because they forget reranking, multi-hop, and observability. This is the full per-query cost stack from a 5-million-query-per-month deployment, with a calculator you can adapt.

The cost everyone forgets

The most common founder email I get about RAG goes something like this: 'our prototype cost $40 per month, then we launched and the bill is $4,200'. The pipeline did not change. The traffic did not even change that much. What changed is that prototype math ignored five of the six real line items, and those five were not free. Across the audits I ran in the last twelve months, teams under-budgeted their per-query RAG cost by a factor of three on average - sometimes by ten.

The reason is structural. RAG is sold as a pattern with two costs: embedding and generation. In production it has six, and the four that nobody talks about - reranker, multi-hop, observability, reindex amortization - together typically out-cost the embedding bill ten times over. This post walks through the full per-query stack at three corpus sizes, the hidden costs at each, and the patterns that actually move the bill.

The 6 line items in RAG cost

Every production RAG pipeline I have shipped lands on the same six-line cost model. Some line items disappear at small scale and some explode at large scale, but they are all there. If your cost spreadsheet has fewer than six rows, you are missing money.

Line itemWhat it isTypical share of bill
Embedding (query + index)Vector creation for queries and ingested chunks5 – 15%
Vector DBStorage, queries-per-second, network egress3 – 20%
Retrieval orchestrationHybrid search, filters, query rewriting (LLM)2 – 10%
RerankerCross-encoder rerank of top-k candidates5 – 20%
GenerationLLM call with system + context + query50 – 80%
Observability + reindex amortizationTracing, evals, periodic re-embedding3 – 15%

Generation almost always wins the share-of-bill contest, but the story underneath the percentages matters. A pipeline whose generation is 80% of cost is healthier than one where vector DB is 40% - the second one is paying a fixed fee that does not scale with revenue. Read the percentages as a diagnostic, not a target.

Per-query math for a small RAG

Small RAG: 1,000 source documents, ~50,000 vectors after chunking, text-embedding-3-small for embeddings, pgvector on a $25/mo Supabase box, no reranker, GPT-5-mini for generation, top-5 retrieval, average 380-token chunks. This is the shape of every side-project chatbot and most early-stage internal tools.

StepUnitRatePer-query cost
Embed query (~80 tokens)80 tokens$0.02 / 1M$0.0000016
Vector DB query (pgvector)1 query$25/mo amortized over 100K queries$0.00025
LLM input (800 sys + 1,900 ctx + 80 user)2,780 tokens$0.25 / 1M$0.000695
LLM output (~320 tokens)320 tokens$2.00 / 1M$0.00064
Observability (self-hosted Langfuse)1 trace-~$0.00001
Total per query--~$0.0016

Roughly 625 queries per dollar. The vector DB amortization is the sleeper here - at low query volume the $25 monthly Supabase fee per query is higher than the embedding. Push traffic to 1M queries per month and that line drops to $0.000025, generation rises to ~95% of cost, and the bill curves up linearly. The full pattern for this kind of pipeline lives in my RAG architecture tutorial.

Per-query math for a medium RAG

Medium RAG: 100,000 source documents, ~5M vectors, mixed text-embedding-3-small + Cohere Rerank 3.5, Qdrant Cloud at $280/mo, GPT-5-mini default with a 10% spillover to GPT-5 on hard questions, top-30 retrieval reranked down to top-6, 400-token chunks, 1,200-token system prompt with tool descriptions. This is the shape of a real B2B knowledge-base assistant once it has real documents.

StepUnitRatePer-query cost
Query rewrite (GPT-5-nano)120 in + 30 out$0.05 / $0.40 per 1M$0.000018
Embed query80 tokens$0.02 / 1M$0.0000016
Vector DB query (Qdrant Cloud)1 query$280/mo over 1M queries$0.00028
Rerank 30 candidates (Cohere 3.5)1 rerank call$2.00 / 1K queries$0.002
LLM input (1,200 sys + 2,400 ctx + 80 user, 90% mini / 10% GPT-5)3,680 tokens blendedblended ~$0.35 / 1M$0.00129
LLM output (~420 tokens blended)420 tokensblended ~$2.80 / 1M$0.00118
Observability (hosted Langfuse)1 trace~$0.0002 amortized$0.0002
Reindex amortization (quarterly partial)0.05% chunk churn/day$0.02 / 1M tokens$0.00006
Total per query--~$0.0050

Note what happened: the reranker jumped to 40% of per-query cost. That looks high, but it is buying meaningful recall improvement and - counterintuitively - saving generation tokens, because top-6 reranked chunks are tighter than top-30 raw vector hits. The medium-RAG bill is also where vector DB amortization starts to matter again: if traffic dips, the Qdrant fixed fee per query triples. Pick a managed vector DB whose pricing curve matches your traffic shape - see the vector database comparison for the per-tier tradeoffs.

Per-query math for a large RAG

Large RAG: 10 million source documents, ~500M vectors, Voyage-3 embeddings (better recall, more expensive), Pinecone Standard at ~$3,200/mo for the index, Cohere Rerank 3.5, GPT-5 default with 20% multi-hop fallback to a 3-step agentic loop, top-100 retrieval reranked to top-12, 600-token chunks, 2,000-token system prompt with citations format. This is the shape of an enterprise search product or a heavy public-facing AI assistant.

StepUnitRatePer-query cost
Query rewrite + classification200 in + 60 out (mini)$0.25 / $2.00 per 1M$0.00017
Embed query (Voyage-3)80 tokens$0.06 / 1M$0.0000048
Vector DB query (Pinecone Standard)1 query (avg over multi-hop)$3,200/mo over 5M queries$0.00064
Rerank 100 candidates1 rerank call$2.00 / 1K queries$0.002
LLM input (2,000 sys + 7,200 ctx + 80 user, blended GPT-5)9,280 tokens blendedblended ~$1.10 / 1M$0.01021
LLM output (~650 tokens blended)650 tokensblended ~$9.00 / 1M$0.00585
Multi-hop premium (20% of queries x 2 extra rounds)~0.4 extra rounds avg$0.016 per extra round$0.0064
Observability (Langfuse + eval sampling)1 trace + 1% evaluated~$0.0008 amortized$0.0008
Reindex amortization0.1% chunk churn/day, Voyage-3$0.06 / 1M tokens$0.00072
Total per query--~$0.026

At 5M queries per month, this large-RAG bill is roughly $130K. The multi-hop premium alone is $32K of that - which sounds like a lot until you realize it is what is buying the answer quality on the 20% of queries that need real reasoning. The lever here is not whether multi-hop exists; it is whether the classifier deciding which queries trigger it is sharp. A bad classifier that fires multi-hop on 50% of queries pushes the bill to $200K with no accuracy gain.

Hidden costs nobody talks about

The per-query tables above already include the line items most teams forget. Here is what makes those line items secretly worse.

Reranker token expansion. A cross-encoder rerank flat fee looks clean on the invoice, but if you are using an LLM as a reranker (some teams do this with GPT-5-mini), you pay for every candidate chunk again. Reranking 30 candidates at 400 tokens each with GPT-5-mini adds 12,000 tokens per query - roughly $0.003. That is the same cost as a dedicated reranker, but with worse latency and worse quality.

Multi-hop retrieval multipliers. A 3-hop agentic RAG query does three retrievals and three generations, not just three retrievals. Each generation step writes a planning trace that becomes input to the next step, so context grows. By round three a query that started at 3,000 input tokens can be at 9,000 - and the rate stays the same. Multi-hop is roughly 3x to 5x the single-hop cost, not 3x.

Observability at write-heavy scale. A traced RAG query writes a parent span, 3 to 8 child spans (embed, retrieve, rerank, generate), and 5 to 15 attributes per span. Self-hosted Langfuse handles this for the cost of database writes; hosted observability platforms meter events. At 10M queries per month with full tracing, the observability bill is real - typically $1.5K to $6K depending on vendor. See the LLM observability comparison for the per-tier pricing.

Reindex on model change. Embedding models are upgraded yearly. Each upgrade requires reembedding the full corpus. A 100M-chunk corpus at 400 tokens per chunk is 40B tokens - $800 with text-embedding-3-small, $5,200 with text-embedding-3-large, $2,400 with Voyage-3. Teams that chase MTEB scores reindex 3 to 5 times per year and pay this fee each time. Pick the embedding model once, evaluate carefully against your own data using the embedding models comparison, and resist swapping for single-digit MTEB gains.

Storage growth. Vector DBs charge per-million vectors stored per month. A corpus that doubles every quarter doubles its storage line every quarter regardless of query traffic. Set a retention policy on stale documents from day one.

The cheapest RAG patterns

Once you have measured your per-query cost, four patterns compound. Apply them in order - earlier patterns make later ones cheaper to add.

Cache embeddings by content hash. Hash each chunk before embedding; skip any chunk whose hash already exists. For a knowledge base that changes slowly (5 to 10% chunk churn per month is typical), this drops embedding spend 85 to 95% after the initial backfill. The same pattern protects you from reindex surprises when a source document is reuploaded unchanged.

Cache retrievals by query embedding. Many queries are semantically near-duplicates of recent ones. A short-TTL embedding-keyed cache (15 to 60 minutes) returns the same top-k chunk IDs without re-querying the vector DB or running the reranker. Typical hit rate on consumer products is 15 to 35%, on internal tools 25 to 50%. Each hit saves the rerank fee and the vector DB query - that is $0.0025 per hit at medium scale, real money at 1M+ queries per day.

Cache generations with prompt-aware keys. Hash the final prompt (system + retrieved chunks + user query) and return the prior response on exact match. Hit rates are low (2 to 8%) but the per-hit savings are the full LLM bill. Pair with OpenAI prompt caching, which catches partial prefix matches even when the user query differs.

Hop-count gating. Route 80% of queries to single-hop and only escalate the 20% that need multi-hop. The classifier runs on GPT-5-nano at fractions of a cent per query. Done right, this halves total RAG cost without measurable accuracy loss. Without it, multi-hop costs creep into every request and the bill compounds.

Cost vs quality tradeoff

Every cost-cutting move has a quality cost. The four places where it is worth spending more, in priority order.

Reranker. The single highest-ROI add. Cohere Rerank 3.5 or Voyage Rerank lift retrieval quality 15 to 35% over raw vector search and pay for themselves in saved generation tokens. Skip the reranker only on small RAG (under 10K vectors) where top-k is already precise.

Larger top-k before rerank. Retrieving 50 to 100 candidates and reranking down to 6 to 10 outperforms retrieving 20 and rerunning. The vector DB cost barely changes; the reranker fee is flat. Most teams under-retrieve before rerank.

Better embeddings on hard corpora. Voyage-3 and text-embedding-3-large outperform text-embedding-3-small by 5 to 15 MTEB points on technical or multilingual corpora. The cost delta is 3x to 6x on embedding spend, but embedding is rarely more than 15% of the bill, so the impact on total cost is modest. Worth it on hard domains; overkill on simple chatbots.

Larger context for hard questions. Top-12 vs top-6 doubles context tokens and roughly doubles input cost on the generation step. Reserve this for queries the classifier flags as complex. Default top-k should stay tight.

Real monthly burn at 3 traffic tiers

Per-query cost makes the bill look small. Multiply by real traffic and the picture lands. Here are the same three pipelines from above at three traffic tiers.

Pipeline10K queries / mo1M queries / mo10M queries / mo
Small RAG ($0.0016/query)$16 + $25 DB = $41$1,600 + $25 = $1,625$16,000 + $25 = $16,025
Medium RAG ($0.005/query)$50 + $280 DB = $330$5,000 + $280 = $5,280$50,000 + $280 = $50,280
Large RAG ($0.026/query)$260 + $3,200 DB = $3,460$26,000 + $3,200 = $29,200$260,000 + $3,200 = $263,200

The fixed-fee vector DB is the part that hurts at low traffic and disappears at high traffic. A team that picks Pinecone Standard at 10K queries per month is paying $3,200 to serve $260 of LLM spend - a 12x overhead. Same team at 10M queries per month pays the same $3,200 to serve $260K of LLM spend - a 1.2% overhead. Match your vector DB pricing curve to your real traffic before signing the contract.

Calculator recipe

Build the spreadsheet before you ship. The columns I keep across every RAG estimate:

  • Queries per month
  • Average chunks retrieved (post-rerank)
  • Average chunk size (tokens)
  • System prompt size (tokens)
  • Average user query size (tokens)
  • Average answer size (tokens)
  • Embedding rate per 1M tokens (query side)
  • Reranker rate per 1K queries
  • LLM input rate per 1M (blended across models)
  • LLM output rate per 1M (blended)
  • Vector DB monthly fee
  • Multi-hop share (0 to 1)
  • Multi-hop extra rounds (avg)
  • Observability rate per traced query
  • Reindex amortization per query
  • Total per query = (query_tokens * embed_rate)/1M + rerank_rate/1K + ((sys+ctx+user)*input_rate + answer*output_rate)/1M * (1 + multi_hop_share * multi_hop_rounds) + observability_rate + reindex_amortization + (vector_db_monthly / monthly_queries)

Run that formula. Multiply by queries per month. Add 15% for retries, eval sampling, and the gap between your average and 95th percentile queries. That is the number to budget against, and it is usually 2 to 4x the back-of-envelope your first prototype produced.

When RAG is too expensive

RAG is the default answer to 'give the model knowledge it does not have', but it is not always the cheapest. Two alternatives win on cost in specific situations.

Long-context. If your full corpus fits in 200K tokens, just put it in the prompt every time. With OpenAI prompt caching on the static corpus, the input cost drops to roughly 10% of nominal on cached tokens. You skip the vector DB, the reranker, the reindex, and the observability complexity. The crossover where RAG becomes cheaper is roughly 1M tokens of corpus with steady traffic - below that, long-context plus caching often wins.

Fine-tuning. If the task is narrow and stylistic rather than knowledge-heavy - classify into one of 20 categories, draft a response in a brand voice, extract a fixed schema - fine-tuning a small model can be cheaper per query than RAG with a frontier model. The breakeven depends on volume; see the full decision tree in fine-tuning vs RAG. For most knowledge-base products above 1M tokens of corpus with steady traffic, RAG stays 5 to 50x cheaper than either alternative - but check the math before committing.

OmniAPI real numbers

Anonymized but real numbers from OmniAPI - a unified API layer I help operate. The knowledge-base RAG feature serves developer documentation queries across roughly 4.2 million chunks from 80 source repositories. Average chunk size 420 tokens. Embeddings on text-embedding-3-small after a carefully evaluated swap from -large. Qdrant Cloud for vectors. Cohere Rerank 3.5. GPT-5-mini default, GPT-5 on 8% of queries, 2-hop agentic fallback on 12%.

MetricValue
Queries per month~3.1M
Average per-query cost$0.0061
Monthly RAG bill~$18,900
Of which: generation (LLM)~$13,400 (71%)
Of which: reranker~$2,600 (14%)
Of which: vector DB (Qdrant)~$1,100 (6%)
Of which: embeddings + reindex amortization~$900 (5%)
Of which: observability + misc~$900 (5%)
Cache hit rate (retrieval)28%
Cache hit rate (generation)5%

Two things to note. First, the per-query cost is meaningfully below the medium-RAG estimate from earlier in this post - that gap is the caching layer doing its job, plus a sharp hop-count classifier. Second, the bill is dominated by generation despite all the optimizations; once you have squeezed the other line items, the LLM call is the only number that moves. The pricing pages for OpenAI and Cohere are the source of truth on per-1M rates; confirm before quoting.

If you are about to ship a RAG pipeline at any of these scales and want it built right from day one, the AI integration engagement includes cost guardrails in the default scope. The full hiring path lives at hire an AI developer in Kosovo. For the full RAG cost playbook in the context of the wider OpenAI bill, pair this post with my OpenAI API cost breakdown. The numbers above include shipped work for OmniAPI.

Frequently asked questions

What does a RAG query actually cost in 2026?

It depends on corpus size and pipeline complexity, but the rough anchors are $0.002 to $0.004 per query for a small RAG (under 100K vectors, single retrieval, GPT-5-mini), $0.012 to $0.020 for a medium RAG with reranking, and $0.06 to $0.12 for a large multi-hop RAG with long context. Most teams under-budget by 3x because they forget reranker tokens, observability, and reindex cost when models change.

Which line item dominates RAG cost at scale?

Generation, almost always. The embedding model is cheap (text-embedding-3-small is $0.02 per 1M tokens). The vector DB is a fixed monthly fee that amortizes down to fractions of a cent per query. The reranker adds 10 to 25%. But the LLM call - system prompt + retrieved chunks + answer - typically eats 70 to 90% of the per-query cost once you are past 50K vectors.

Is it cheaper to use a reranker or just retrieve more chunks?

Reranker, in almost every case. Retrieving 30 chunks and stuffing all of them into the generation prompt costs more in LLM input tokens than retrieving 50 chunks, reranking with Cohere Rerank 3.5 at $2 per 1K queries, and passing the top 6 to generation. The reranker is roughly $0.002 per query; the saved generation tokens are usually $0.005 to $0.010.

When does multi-hop retrieval blow up the bill?

When a query triggers two or more retrieval rounds with full generation between each round. A 3-hop agentic RAG query at medium scale runs $0.04 to $0.08 - roughly 3x to 5x a single-hop query. Multi-hop is worth it for hard questions where single-hop accuracy is below 70%, but route to it selectively. A classifier on GPT-5-nano can decide hop count for under $0.0001 per query.

How much does observability really add to RAG cost?

Self-hosted Langfuse adds essentially zero per-query cost beyond a tiny database write. Hosted Langfuse, LangSmith, or Helicone add roughly $0.0001 to $0.0005 per traced query at production tier. At 10M queries per month that is $1K to $5K - small relative to the LLM bill but not free, and worth pricing in.

Should I cache retrievals or just the generation step?

Cache both, separately. Retrieval cache (query embedding hash to top-k chunk IDs) cuts vector DB load and embedding cost on repeat queries - typical hit rate 15 to 35% on consumer products. Generation cache (full prompt hash to response) saves the full LLM cost on exact hits - hit rate is lower (2 to 8%) but the savings per hit are large. OpenAI prompt caching layers on top for partial hits on the system prompt.

When does long-context or fine-tuning beat RAG on cost?

Long-context wins when your corpus is under ~200K tokens and queries are infrequent - you skip the vector DB entirely, pay once per query with prompt caching, and avoid reindex overhead. Fine-tuning wins when the task is narrow and stylistic rather than knowledge-heavy. For most production knowledge bases above 1M tokens with steady traffic, RAG remains 5 to 50x cheaper per query than stuffing the whole corpus into context.

What is the single biggest mistake teams make on RAG cost?

Reindexing the entire corpus when they switch embedding models. A 100M chunk corpus costs $2,600 to reembed with text-embedding-3-large in one pass. Teams do this 3 to 5 times in the first year as they chase MTEB scores. The fix is to evaluate embedding swaps on a fixed sample before committing, and to keep a hash-keyed embedding cache so unchanged chunks are not re-billed when source documents are updated.