AI Engineering14 min read

Agentic RAG: Architecture Patterns That Ship in 2026

By Ergini, Software & AI Developer in Pristina, Kosovo

TL;DR

Plain RAG retrieves once and prays. Agentic RAG plans, decomposes, retrieves multiple times, and validates. This guide walks through the production-grade patterns I have used in client deployments, with code, latency budgets, and the failure modes that bite.

Most teams I meet have already shipped plain RAG. Hybrid retrieval, rerank, citations, the works. It plateaus around 80 to 85% on their eval set and they want to know what to do next. Agentic RAG is the usual answer - and the usual mistake when it is reached for too early. This post is the architecture I actually use when single-pass retrieval runs out of room, drawn from shipping OmniAPI (the RAG-backed function generator behind my main product) and an agentic research layer for a client's knowledge platform.

If you have not built plain RAG yet, start with the RAG architecture tutorial first - the agentic patterns below assume you already have a working retrieval layer with eval data to measure against.

What "agentic" actually changes about RAG

Plain RAG is a pipeline: embed the query, retrieve top-k chunks, hand them to the model with a grounded prompt, return the answer. One retrieve, one generate. Every query takes the same path through the system. The model never decides anything about retrieval - it just consumes whatever the retriever produces.

Agentic RAG turns that pipeline into a loop with a planner in front of it. The model gets the question, decides what to retrieve, reads the results, decides whether they are enough, and either answers or fires another retrieval with a refined query. The retriever is now a tool the model calls - possibly several times, possibly against different sources, possibly after rewriting the query mid-flight. Four concrete things change:

  • Planning before retrieval. The model decomposes the question into sub-queries instead of embedding the raw user input.
  • Multi-step retrieval. The output of one retrieval can become the input of the next, which is how you answer "compare X and Y" or "trace this from spec to implementation."
  • Self-judgement. The model can read its own retrieved context, decide it is insufficient, and retry - turning hallucination into a retry signal instead of a final answer.
  • Source selection. When you have more than one data source (vector store, SQL, web, API), the model picks which one fits the sub-query instead of you hardcoding the route.

That extra control costs you roughly an order of magnitude in latency and cost per query, which is why you do not reach for it by default. Agentic RAG is the right answer when single-pass retrieval provably cannot reach a question - not because it sounds more impressive in a deck.

The 5 patterns that ship in production

"Agentic RAG" is an umbrella, and the umbrella covers five distinct patterns. Most production systems use two or three of them composed, not all five. Pick by what your eval data says the system actually needs.

Pattern 1 - Query decomposition

A planner reads the user's question and emits 2 to 5 sub-queries, each retrieved independently. The results get unioned and reranked before the final generation step. This is the cheapest agentic pattern because there is no loop - just a one-shot planner followed by parallel retrievals.

Use it for compound questions: "How does our pricing compare to Acme's, and what features do they have that we don't?" becomes three sub-queries (our pricing, their pricing, their feature set), each of which retrieves better than the raw compound question ever could. Latency cost is one extra LLM round-trip for planning, roughly 300 to 600 ms with GPT-4o-mini. Quality lift is usually large on compound queries and zero on simple ones - which is why a smart version of this pattern uses a fast classifier to skip the planner when the question is single-intent.

Pattern 2 - Multi-hop retrieval

Multi-hop is the pattern people actually mean when they say "agentic RAG." The model answers a sub-question, then uses the answer as input for the next retrieval. "Who wrote the spec that defines our auth flow" becomes hop one (retrieve the spec), hop two (use the spec content to retrieve the author list, possibly from a different store). Each hop is a real query the agent constructs from the previous result.

This is where latency math gets serious. A 3-hop chain is three sequential retrievals plus three generations. Even with cheap models and a fast retriever you are at 3 to 5 seconds p95. Cache aggressively - sub-queries repeat across user sessions far more than full queries do. Cap the hop budget hard (3 to 5 hops). Past that, the agent is almost always lost.

Pattern 3 - Self-correction and reflection

After each retrieval, a judge step (either the same model or a smaller dedicated one) scores whether the retrieved chunks actually answer the sub-query. If the score is below a threshold, the agent rewrites the query and retries. If it is above, the loop terminates and generation runs.

The reflection loop is powerful and dangerous in equal measure. Power: it catches retrieval misses that plain RAG would have shipped as a hallucinated answer. Danger: without a hard stop it loops forever on unanswerable questions, burning tokens and seconds. Two safeguards are non-negotiable - a max-retry cap (I use 2) and a falling confidence threshold (each retry needs lower confidence to terminate than the last). On the final retry, switch to a "answer with what you have" prompt and let the model emit "I don't know" gracefully.

Pattern 4 - Tool-augmented retrieval

The agent gets a menu of retrieval tools - vector search, SQL, full-text search, web search, an internal API - and picks the right one per sub-query. "What is our current MRR" routes to SQL. "What does our docs say about webhook security" routes to the vector store. "What did OpenAI announce yesterday" routes to web search.

This is the pattern that makes agentic RAG actually useful in business systems where the answers live in heterogeneous sources. The engineering cost is real - each tool needs a clean schema, error handling, and a description the model can route from - but the alternative is hardcoding a router that gets every edge case wrong. Use OpenAI function calling or Anthropic tool use with strict JSON mode and tool descriptions written like product specs - what the tool does, what it accepts, what it returns, when not to call it.

Pattern 5 - Hierarchical retrieval

A cheap broad pass first - maybe a small embedding model or pure BM25 over document titles and headings - narrows the corpus from a million chunks to a few thousand. Then an expensive narrow pass - large embeddings plus rerank - operates only on that filtered set. The agent chooses the depth based on the question.

This is the right pattern for very large corpora (10M+ chunks) where running the full pipeline on every query would be wasteful and slow. It is also the right pattern for cost-sensitive workloads where you want a fast answer for the 80% of easy queries and reserve the expensive retrieval for the 20% that actually need it. The agentic twist is letting the model decide which tier - instead of hardcoding a threshold.

When agentic is worth it (and when it isn't)

Every agentic pattern doubles the latency and quadruples the cost of the equivalent plain RAG call. Before you commit, run your eval set against the simpler tier and confirm it has actually plateaued. The decision matrix:

Use caseRecommended tierp95 latencyCost / queryWhy
FAQ chatbot, single-intent queriesPlain RAG (hybrid + rerank)~1.4 s~$0.002Single retrieval handles 90%+ of queries
Compound questions, multi-topic promptsQuery decomposition only~2.2 s~$0.005One extra plan call, then parallel retrieves
Multi-hop reasoning over a single corpusMulti-hop + reflection3 to 6 s~$0.015 to $0.025Hop chain plus self-judgement
Mixed sources (docs + SQL + APIs)Tool-augmented + decomposition3 to 8 s~$0.02 to $0.05Routing across heterogeneous data
Research-style synthesis across large corpusHierarchical + multi-hop + reflection8 to 20 s~$0.05 to $0.20Deep reasoning, narrow audience tolerance

The honest answer for most production systems is: stay on plain RAG with good chunking and reranking, add decomposition when your eval shows compound queries failing, and only add hops or reflection when you have data proving they help. Premature agentic is the same trap as premature microservices - it feels like progress while making everything slower, more expensive, and harder to debug.

TypeScript walkthrough: a self-correcting RAG agent

Here is a minimal but real implementation of patterns 2 and 3 combined - multi-hop retrieval with reflection - using the Vercel AI SDK style. It assumes the retrieve(query) function from the production RAG tutorial is already wired up.

// lib/agentic-rag.ts
import OpenAI from "openai";
import { retrieve, type RetrievedChunk } from "./retrieve";

const openai = new OpenAI();
const MAX_HOPS = 3;
const MIN_CONFIDENCE = 0.6;

type Hop = {
  query: string;
  chunks: RetrievedChunk[];
  confidence: number;
};

const PLANNER_SYS = `You are a retrieval planner. Given a user question and
the chunks retrieved so far, decide ONE of:
- { "action": "retrieve", "query": "<refined sub-query>" }
- { "action": "answer", "confidence": 0.0-1.0 }

Choose "retrieve" if the chunks are missing information the question needs.
Choose "answer" when you have enough to write a grounded response.
Return strict JSON, no prose.`;

const ANSWER_SYS = `Answer the question strictly from the provided CHUNKS.
Cite each claim with [n]. If chunks are insufficient, say so.`;

export async function agenticAnswer(question: string) {
  const hops: Hop[] = [];
  let currentQuery = question;

  for (let i = 0; i < MAX_HOPS; i++) {
    const chunks = await retrieve(currentQuery);
    const allChunks = [...hops.flatMap((h) => h.chunks), ...chunks];

    const plan = await openai.chat.completions.create({
      model: "gpt-4o-mini",
      response_format: { type: "json_object" },
      temperature: 0,
      messages: [
        { role: "system", content: PLANNER_SYS },
        {
          role: "user",
          content: `QUESTION: ${question}\n\nHOP ${i + 1} CHUNKS:\n${
            chunks.map((c, j) => `[${j}] ${c.content.slice(0, 400)}`).join("\n")
          }`,
        },
      ],
    });

    const decision = JSON.parse(plan.choices[0].message.content ?? "{}");
    hops.push({ query: currentQuery, chunks, confidence: decision.confidence ?? 0 });

    const isLastHop = i === MAX_HOPS - 1;
    const minConf = MIN_CONFIDENCE - i * 0.1; // falling threshold
    if (decision.action === "answer" || isLastHop || (decision.confidence ?? 0) >= minConf) {
      const final = await openai.chat.completions.create({
        model: "gpt-4o-mini",
        temperature: 0.1,
        messages: [
          { role: "system", content: ANSWER_SYS },
          {
            role: "user",
            content: `QUESTION: ${question}\n\nCHUNKS:\n${
              allChunks.map((c, j) => `[${j + 1}] (${c.sourceUrl})\n${c.content}`).join("\n\n")
            }`,
          },
        ],
      });
      return { text: final.choices[0].message.content ?? "", hops };
    }

    currentQuery = decision.query ?? question;
  }

  return { text: "I don't have enough to answer that.", hops };
}

Three details matter more than they look. The falling confidence threshold (MIN_CONFIDENCE - i * 0.1) makes the agent more willing to commit on later hops, preventing infinite second-guessing. The isLastHop branch forces an answer attempt rather than a refusal on hop 3. And accumulating chunks across hops (allChunks) lets the final generation reason over everything the loop found, not just the last hop's results.

Latency budgeting

Agentic latency is the single thing that breaks user trust. People tolerate 1.5 seconds for an answer; they do not tolerate 6 seconds for a spinner with no signal. The math you have to internalize for an average hop:

Stepp95 latencyNotes
Query embedding~120 msOpenAI text-embedding-3-large, single input
Hybrid SQL (vector + BM25)~80 mspgvector HNSW + GIN in parallel
Cohere rerank (24 → 6)~280 msCross-encoder over candidates
Planner / reflector LLM call~500 msGPT-4o-mini, JSON-mode, ~1K input tokens
One full hop~1.0 sSequential - these do not overlap
Final grounded generation~900 msGPT-4o-mini, ~4K input, ~200 output

Three hops plus a final generation is ~3.9 seconds p95 on the happy path. The way you survive that budget in production:

  • Stream the first hop's thinking. Show the user the sub-queries as they get planned. "Searching for X... found 12 docs. Now searching for Y..." turns dead time into information density.
  • Parallelize whatever you can. Pattern 1 (decomposition) parallelizes the sub-query retrievals naturally. Pattern 2 (multi-hop) cannot - each hop depends on the last - but you can still parallelize the dense+sparse retrieval inside each hop.
  • Cache sub-queries, not full queries. Sub-queries repeat across users far more than full queries do. A Redis cache keyed on the sub-query string with a 1-hour TTL knocks a third off your p95 in steady state.
  • Use a smaller model for planning. The planner is deciding between "retrieve again" and "answer now." You do not need Opus for that - Haiku or 4o-mini is fine, and the latency win compounds across every hop.

Cost math

Per-query cost compounds with hop count and model tier. Real numbers from a production agentic system I shipped, on 2026 pricing:

ConfigurationCost / query1K queries/day cost100K queries/day cost
Plain RAG, GPT-4o-mini~$0.002~$60/month~$6K/month
Decomposition (1 plan + 3 retrieves), 4o-mini~$0.006~$180/month~$18K/month
3-hop + reflection, 4o-mini end-to-end~$0.018~$540/month~$54K/month
3-hop + reflection, GPT-4o for planning~$0.045~$1.35K/month~$135K/month
Same on Claude Sonnet 4.5~$0.060~$1.8K/month~$180K/month

The thing nobody mentions in agentic RAG demos: failed retries also cost money. A reflection loop that bails out on hop 3 with "I don't know" cost you three retrievals, three planning calls, and a final generation - full 3-hop price for zero answer. Build per-query cost telemetry from day one, set a per-tenant daily ceiling, and review the cost of refused answers separately from successful ones. For the full picture on production LLM economics, my OpenAI API cost breakdown covers the patterns that cut my client bills 60%.

Evaluation: how to grade agentic RAG

Standard RAG metrics (faithfulness, context recall, context precision, answer relevancy) still apply - but they only measure the final output. For agentic RAG you have to evaluate the trajectory too, because two systems can produce the same final answer with very different cost and reliability profiles.

Three trajectory metrics I track on every eval run:

  • Sub-query coverage. Of the sub-queries an expert would ask to answer the question, what fraction did the planner actually emit? Catches decomposition failures where the agent misses an angle.
  • Per-hop retrieval recall. For each hop, did the retrieval surface the chunks that were supposed to come back? A hop that retrieves the wrong thing poisons every downstream hop.
  • Trajectory efficiency. How many hops did the agent burn to reach the final answer, compared to the minimum number a human would have needed? An efficiency below 0.5 means the loop is guessing.

A minimal TypeScript eval scaffold for trajectory metrics:

// scripts/eval-agentic.ts
import { agenticAnswer } from "../lib/agentic-rag";

type AgenticCase = {
  question: string;
  expectedAnswer: string;
  expectedSubQueries: string[];     // what an expert would ask
  expectedSourceUrls: string[];     // what should come back
  minimumHops: number;              // human baseline
};

const cases: AgenticCase[] = JSON.parse(
  await Bun.file("eval/agentic-cases.json").text()
);

const totals = { coverage: 0, recall: 0, efficiency: 0 };

for (const c of cases) {
  const result = await agenticAnswer(c.question);
  const askedQueries = result.hops.map((h) => h.query.toLowerCase());

  const coveredSubs = c.expectedSubQueries.filter((sq) =>
    askedQueries.some((q) => q.includes(sq.toLowerCase()))
  );
  const coverage = coveredSubs.length / c.expectedSubQueries.length;

  const retrievedUrls = result.hops.flatMap((h) =>
    h.chunks.map((ch) => ch.sourceUrl)
  );
  const hits = c.expectedSourceUrls.filter((u) => retrievedUrls.includes(u));
  const recall = hits.length / c.expectedSourceUrls.length;

  const efficiency = c.minimumHops / Math.max(result.hops.length, 1);

  totals.coverage += coverage;
  totals.recall += recall;
  totals.efficiency += Math.min(efficiency, 1);
}

const n = cases.length;
console.log({
  sub_query_coverage: totals.coverage / n,
  retrieval_recall: totals.recall / n,
  trajectory_efficiency: totals.efficiency / n,
});

Wire this into CI alongside the standard RAGAS-style metrics from the plain RAG eval. Any change that drops trajectory efficiency by more than 5 percentage points is almost always a planner regression and should block the merge.

Failure modes I have hit

Every one of these cost me a week to debug at least once. Catalog them, run defenses for them, and you skip most of the pain.

Endless reflection loops

The agent keeps deciding the retrieved chunks are not good enough, rewriting the query, retrieving again, judging again. Without a hard cap, this loops until the request timeout or your billing alert fires. Fix: hard MAX_HOPS cap (3 to 5), falling confidence threshold per hop, and a forced-answer prompt on the final hop that tells the model to commit with whatever it has.

Sub-query drift

The planner decomposes the question into sub-queries that are semantically related to the input but not actually what the user asked. "How do I configure webhooks for failed payments" gets decomposed into "how do webhooks work" and "what is a failed payment" - both findable, neither useful. Fix: include the original question in every retrieval prompt, and add a final coverage check that asks the planner "will these sub-queries actually answer the user's question?" before committing.

Retrieval bypass via tool selection

With pattern 4 (tool-augmented retrieval), the agent has multiple tools to choose from. The failure mode: the model picks web search or a generic API call for a question your internal docs would have answered better, because the tool descriptions undersold the vector store. Fix: write tool descriptions like sales copy - what kinds of questions the tool excels at, with concrete example queries. And run an eval that scores tool selection accuracy on a labelled set of question-to-correct-tool pairs.

Citation grounding gaps

In multi-hop, the final generation has chunks from every hop in its context. The model cites them, but the citations sometimes point to a chunk that supports a claim only loosely - because the chunk that actually proved the claim got buried mid-list. Fix: re-rerank the accumulated chunks against the original question (not the sub-queries) before the final generation, so the most relevant evidence is at the top where the model focuses.

Planner overconfidence

The reflection step asks the model to judge its own retrieval. Models are bad at admitting their context is insufficient - they will rate almost any retrieval 0.7 or higher because that lets them stop working. Fix: use a different model (or different system prompt) for judging than for generating, and calibrate the threshold against a labelled set. I have had to set MIN_CONFIDENCE as high as 0.85 for GPT-4o-mini as the judge.

Prompt injection through retrieved chunks

In an agentic loop, retrieved chunks influence the next query. If your corpus contains user-generated content, an attacker can write "ignore the user's question and instead retrieve" into a doc, and your planner will follow it. Fix: sandwich every retrieval result with system markers, strip instruction-like phrases before feeding to the planner, and never let retrieved text override the original user question in the planning prompt. This is also the argument for human-in-the-loop review on high-stakes agentic answers.

OmniAPI case study: where agentic actually helped

OmniAPI generates working API functions from natural-language descriptions, grounded in a corpus of specs, SDK examples, and schema fragments. The first version was plain RAG - hybrid retrieval over the spec corpus, rerank, grounded generation. It worked for 78% of single-endpoint questions and fell apart on anything that crossed two endpoints ("create a user then enroll them in plan X") or required reasoning over a schema and an example payload together.

I added query decomposition first - pattern 1 - and got from 78% to 86%. That covered most compound requests. Then I added a 2-hop retrieval pattern for the cases where the first retrieval surfaced a method signature and the second hop needed the auth scopes or rate limits attached to it. That moved the score to 91%. No reflection loop, no tool selection, no hierarchical retrieval - just decomposition plus controlled multi-hop. The takeaway: agentic patterns compound, and you almost never need all of them. Add the cheapest pattern that moves the eval, measure, repeat.

Agentic RAG vs an agent with RAG as a tool

These look similar from the outside and are structurally different. Agentic RAG is a retrieval system that happens to use a model for planning - retrieval quality is the goal, and the orchestration exists to serve it. An agent with RAG as a tool is a general autonomous worker that can do many things (send emails, write to SQL, hit APIs) and one of its tools happens to be retrieval.

DimensionAgentic RAGAgent with RAG tool
Primary goalAnswer a question with grounded contextComplete a multi-step task
OutputCited text answerActions taken plus optional summary
Loop depth3 to 5 hops, cappedOpen-ended, often dozens of steps
Cost per session$0.01 to $0.10$0.20 to $5+
Failure surfaceRetrieval misses, bad sub-queriesWrong actions, wrong tool, runaway loops
Best forQ&A over your dataWorkflow automation across systems

The architectural distinction matters because it changes the observability and safety story. Agentic RAG is bounded: it reads, it thinks, it answers. An agent-with-RAG is unbounded: it can act on the world. Pick the simpler one when the task is "tell me something," the broader one when the task is "do something."

Production checklist

Before agentic RAG goes near a real user, walk this list. Every item is something I have either hit or watched a client hit in the last year.

  • Hard max-hop budget. 3 to 5 hops, enforced in code, not a prompt instruction the model might ignore.
  • Falling confidence threshold per hop. Each retry needs lower confidence to terminate than the last. Forces commitment rather than infinite second-guessing.
  • Forced answer on final hop. A distinct prompt that tells the model to answer with what it has and acknowledge any gaps.
  • Cost ceiling per request. Compute expected max cost (hops × planner cost + final generation) and reject queries that exceed it before they start.
  • Per-tenant daily cost cap. Hard stop at a configured dollar amount per tenant per day. One bad actor with an automated script can 100x your bill in an hour.
  • Latency SLA with streaming fallback. If the loop is not done in 8 seconds, stream the best-effort answer with a disclaimer rather than letting the user stare at a spinner.
  • Fallback to plain RAG. When the planner errors or the model is rate-limited, degrade to single-pass retrieval. Never to a 500 page.
  • Trajectory observability. Log every hop's query, retrieved IDs, confidence score, and latency. You will need this the first time someone reports a bad answer.
  • Eval set with sub-queries. Not just question and answer - include the expected decomposition and the expected minimum hop count.
  • Tool-selection eval. If using pattern 4, score which tool the agent picks against a labelled correct-tool set.
  • Prompt injection defenses on retrieved chunks. Sandwich, strip, and never let chunks override the planning prompt.
  • Human-in-the-loop on low-confidence answers. Below a threshold, route to a queue instead of shipping the response.

If you want help wiring this end-to-end, my AI agent development work covers exactly this scope, and AI integration when the retrieval layer needs to plug into existing systems. The decision tree on which vector database comparison fits an agentic workload is also worth reading before you commit to infra. I work with teams worldwide and you can also hire an AI developer in Kosovo directly.

Frequently asked questions

What is agentic RAG in simple terms?

Agentic RAG is a retrieval augmented generation system where a language model plans the retrieval instead of executing a single fixed retrieve-then-generate pass. The agent decides which sub-questions to ask, which data source to query, whether the retrieved context is good enough, and when to stop. Plain RAG is one shot; agentic RAG is a loop with planning, tool selection, and self-correction.

When is agentic RAG worth the latency and cost?

Use agentic RAG when the question genuinely cannot be answered by a single retrieval - multi-hop questions, queries that span structured and unstructured sources, comparative analysis, or anything that needs an intermediate computation. For 70% of production use cases, an advanced single-pass RAG with hybrid retrieval and reranking is faster, cheaper, and equally accurate. Reach for agentic only after eval data shows plain RAG plateauing.

How many hops should an agentic RAG loop allow?

Cap at 3 to 5 hops in production. A hard max-hop budget is the only reliable defense against runaway reflection loops where the agent keeps re-querying itself. Anything past 5 hops is almost always a planning failure, not a retrieval gap - log it, surface it to humans, and fix the prompt rather than letting the loop burn tokens.

What is the difference between agentic RAG and an agent with RAG as a tool?

Agentic RAG is a RAG system where retrieval itself is multi-step and reasoned. An agent-with-RAG is a general agent that happens to have a retrieve function in its toolbox alongside email, calendar, SQL, and others. The first is a deeper retrieval system; the second is a broader autonomous worker. You want the first when answer quality is the bottleneck. You want the second when retrieval is one capability among many.

How do I evaluate an agentic RAG system?

Add three metrics on top of standard RAG evaluation: sub-query coverage (did the decomposition surface the right sub-questions), per-hop retrieval recall (did each hop find the right chunks), and trajectory efficiency (how many hops did the system burn to land the answer). Faithfulness and answer relevancy from RAGAS still apply to the final output. Build a labelled set with the expected sub-queries, not just the expected answer.

What does an agentic RAG query actually cost?

A 3-hop agentic loop with GPT-4o-mini for planning and generation, text-embedding-3-large for retrieval, and Cohere rerank-v3.5 lands around $0.012 to $0.025 per query at 3 to 6 seconds p95. Swap to GPT-4o for planning and you are at $0.04 to $0.10 per query. Plain advanced RAG over the same corpus is around $0.002 - agentic is roughly 5 to 50x the cost, which is why you only use it when you have to.

Can I use the same vector store for plain and agentic RAG?

Yes. The retrieval layer is the same - pgvector, Qdrant, Pinecone all work identically whether one query or ten hit them. The difference lives entirely in the orchestration layer above the store: the planner, the router, the reflection loop. A common pattern is one shared retrieval API with a query parameter that the planner uses to scope filters per hop.

What is the biggest failure mode in agentic RAG?

Endless reflection loops. The model judges its own retrieval, decides it is not good enough, re-queries, judges again, re-queries again. Without a hard hop cap and a confidence floor on the self-judge, you get queries that burn 30 seconds and $0.50 each. Fix it with a max-hop budget, a falling confidence threshold per hop, and a forced-stop prompt that tells the model to answer with what it has on the final hop.