AI Engineering16 min read

RAG Architecture Tutorial: Production System in 2026

By Ergini, Software & AI Developer in Pristina, Kosovo

TL;DR

RAG architecture has four components: ingestion and chunking, embedding and vector store, retrieval, and grounded generation. Naive RAG ships in a weekend; production RAG takes weeks because 73% of failures hide in the retrieval step. This tutorial is the architecture I actually use in production, with code, evaluation, and the five failure modes that bite.

Most RAG tutorials build a toy that works on five PDFs and call it a day. This one is the architecture I actually run in production - the same shape that powers OmniAPI, my AI-generated functions service that answers function-shaping questions over a constantly growing corpus of specs, schemas, and example payloads. After shipping it I rewrote my mental model for what RAG actually needs. This post is that rewrite, with TypeScript code you could lift into a Next.js app today.

What RAG actually is in 2026

RAG (retrieval augmented generation) is a pattern: at query time, fetch relevant chunks from a data store you control and inject them into the prompt so the language model answers from your sources instead of its training data. That is the entire idea. Everything else - embeddings, vector databases, rerankers, hybrid search - is plumbing to make retrieval good enough that the generation step has something useful to ground on.

In 2026 the question is no longer "RAG vs fine-tune." The sensible decision tree is shorter than people pretend. Use long context if the entire corpus fits in the model's window on every request and you do not mind paying for those tokens. Use fine-tuning when you need the model to adopt a style, format, or capability that prompting cannot reliably elicit. Use RAG when the corpus is larger than context, changes more often than you can afford to retrain, or when you need citations and audit trails. Most production systems end up using two of the three.

RAG wins three things that long-context and fine-tuning lose: freshness (a new doc is searchable seconds after ingest), citation (you can show the user the exact source span), and unit economics (you pay for the chunks you actually need, not the whole corpus on every call).

The four components of RAG architecture

Every production RAG system, no matter how fancy, decomposes into four components. If you cannot point to where each lives in your codebase, you are going to debug retrieval failures by guessing.

Ingestion and chunking

Ingestion takes raw documents - markdown, PDF, HTML, transcripts, database rows - and turns them into normalized text. Chunking splits that text into retrieval units small enough to be precise but large enough to carry meaning. The default failure here is splitting on character count and shredding semantic structure. Headers get separated from the paragraphs underneath them, code blocks get cut in half, and tables get severed mid-row.

The cheap fix is structure-aware chunking. Split markdown on h2/h3 boundaries first, then sub-split sections that exceed your token budget. Use a tokenizer (tiktoken for OpenAI, the model's own tokenizer for others) - not character counts - so 500 tokens means 500 tokens. Add 80 to 120 tokens of overlap so concepts that straddle a boundary survive. For PDFs, use a layout-aware parser like Unstructured or LlamaParse rather than raw text extraction; the difference in retrieval quality on real documents is large.

Embedding and vector store

Each chunk becomes a vector via an embedding model. In 2026 the defaults I reach for are text-embedding-3-large (3072 dims, strong general performance, ~$0.13 per million tokens) and voyage-3 for technical or code-heavy corpora. The vector store holds those vectors, the chunk text, and any metadata you want to filter on (source URL, author, timestamp, tenant ID).

For 90% of teams, pgvector on Postgres is the right starting point. It comoves with your application data, supports HNSW indexes for fast approximate search, and lets you do hybrid retrieval in a single query. Reach for Qdrant, Weaviate, or Pinecone when you need multi-tenant scale, billions of vectors, or specialized quantization. Do not start there - the operational overhead is not worth it before you have a working baseline.

Retrieval (dense, sparse, hybrid)

Dense retrieval (cosine similarity over embeddings) finds semantically similar chunks. Sparse retrieval (BM25, tsvector) finds chunks that share exact terms. They fail in different ways: dense misses rare keywords like product codes or model numbers, sparse misses paraphrases. Hybrid retrieval runs both and fuses the results, then a cross-encoder reranker scores the union to pick the final top-k.

The reranker is the single biggest quality lever after chunking. A bi-encoder embedding squashes both query and document into independent vectors; a cross-encoder reads them together and produces a relevance score that captures interaction the embedding could not. Cohere rerank-v3.5 and BGE-reranker-v2 are the obvious picks. The standard recipe: retrieve top 20 to 40 candidates from hybrid search, rerank, keep the top 4 to 8 for the prompt.

Grounded generation

The final step is calling the LLM with the retrieved chunks and a prompt that forces it to answer from those chunks with citations. Most of the prompt engineering work here is negative: tell the model what not to do (do not use prior knowledge, do not invent citations, say "I don't know" when the chunks do not contain the answer). For most use cases, GPT-4o-mini or Claude Haiku 4.5 are the right defaults - they handle grounded generation well at low cost. Reach for GPT-4o, Claude Sonnet 4.5, or Opus only when you need complex reasoning on top of retrieval.

Naive vs advanced vs agentic RAG

Not every system needs the full machinery. Pick the simplest tier that actually meets your quality bar - moving up the stack roughly doubles latency and quadruples complexity.

TierWhen to useComponentsp95 latencyCost / queryComplexity
NaiveInternal demo, prototype, small static corpusEmbed + top-k + prompt~600 ms~$0.001Low (1 weekend)
AdvancedPublic-facing product, mixed query types, real users with jargonQuery rewrite + hybrid + rerank + grounded prompt with citations~1.4 s~$0.002 to $0.005Medium (2 to 4 weeks)
AgenticMulti-hop questions, planning over multiple data sources, tool useRouter + multi-query + multi-source retrieval + reflection + tool calls3 to 8 s~$0.02 to $0.10High (6+ weeks)

Start at advanced. Naive RAG fails the first time a user types something you did not anticipate. Agentic RAG is worth it only when single-hop retrieval provably cannot answer the question - and most of the time it can.

Step-by-step build (TypeScript)

The rest of this tutorial walks through building an advanced RAG system over markdown docs in TypeScript, using Next.js, pgvector, and OpenAI. Every snippet is real code, lightly trimmed for clarity. If you wire these pieces together you have a working advanced-tier RAG.

Setup and dependencies

You need Postgres with the pgvector extension installed, a Node 20+ runtime, and API keys for OpenAI and Cohere. Add these to your package.json:

{
  "dependencies": {
    "openai": "^4.77.0",
    "cohere-ai": "^7.15.0",
    "pg": "^8.13.1",
    "tiktoken": "^1.0.18",
    "unified": "^11.0.5",
    "remark-parse": "^11.0.0",
    "mdast-util-to-string": "^4.0.0"
  }
}

Initialize the pgvector extension and create your schema:

-- migration.sql
CREATE EXTENSION IF NOT EXISTS vector;

CREATE TABLE documents (
  id           BIGSERIAL PRIMARY KEY,
  source_url   TEXT NOT NULL,
  source_hash  TEXT NOT NULL,
  updated_at   TIMESTAMPTZ NOT NULL DEFAULT now()
);

CREATE TABLE chunks (
  id            BIGSERIAL PRIMARY KEY,
  document_id   BIGINT NOT NULL REFERENCES documents(id) ON DELETE CASCADE,
  ord           INT NOT NULL,
  heading_path  TEXT,
  content       TEXT NOT NULL,
  token_count   INT NOT NULL,
  embedding     vector(3072) NOT NULL,
  content_tsv   tsvector GENERATED ALWAYS AS (to_tsvector('english', content)) STORED
);

CREATE INDEX chunks_embedding_idx
  ON chunks USING hnsw (embedding vector_cosine_ops)
  WITH (m = 16, ef_construction = 64);

CREATE INDEX chunks_tsv_idx ON chunks USING gin (content_tsv);
CREATE INDEX chunks_document_id_idx ON chunks (document_id);

Chunking strategy

Here is a markdown-aware chunker that splits on h2/h3 boundaries, respects a token budget, and overlaps so concepts that span a boundary survive. The key idea: a chunk carries its heading path as a prefix so the embedding model knows the surrounding context even if the chunk text alone is ambiguous.

// lib/chunker.ts
import { unified } from "unified";
import remarkParse from "remark-parse";
import { toString } from "mdast-util-to-string";
import { encoding_for_model } from "tiktoken";

const enc = encoding_for_model("text-embedding-3-large");

export type Chunk = {
  ord: number;
  headingPath: string;
  content: string;
  tokenCount: number;
};

const MAX_TOKENS = 700;
const OVERLAP_TOKENS = 100;

export function chunkMarkdown(md: string): Chunk[] {
  const tree = unified().use(remarkParse).parse(md) as any;
  const sections: { path: string[]; text: string }[] = [];
  let path: string[] = [];

  for (const node of tree.children) {
    if (node.type === "heading") {
      path = path.slice(0, node.depth - 1);
      path[node.depth - 1] = toString(node);
      continue;
    }
    const text = toString(node).trim();
    if (!text) continue;
    sections.push({ path: [...path], text });
  }

  const chunks: Chunk[] = [];
  let ord = 0;

  for (const section of sections) {
    const heading = section.path.join(" > ");
    const tokens = enc.encode(section.text);

    if (tokens.length <= MAX_TOKENS) {
      chunks.push({
        ord: ord++,
        headingPath: heading,
        content: `[${heading}]\n${section.text}`,
        tokenCount: tokens.length,
      });
      continue;
    }

    let cursor = 0;
    while (cursor < tokens.length) {
      const slice = tokens.slice(cursor, cursor + MAX_TOKENS);
      const text = new TextDecoder().decode(enc.decode(slice));
      chunks.push({
        ord: ord++,
        headingPath: heading,
        content: `[${heading}]\n${text}`,
        tokenCount: slice.length,
      });
      cursor += MAX_TOKENS - OVERLAP_TOKENS;
    }
  }

  return chunks;
}

Notice the chunk content starts with the heading path in brackets. That single change typically lifts retrieval precision by 10 to 20% on documentation corpora because queries like "how do I configure webhooks" latch onto chunks whose heading is literally "Webhooks", even when the body text does not contain the word.

Embedding and storing in pgvector

Embedding is straightforward - batch your chunks (OpenAI accepts up to 2048 inputs per request) and insert with the vector column. Always store a content hash on the parent document so you can skip re-embedding when nothing changed.

// lib/ingest.ts
import OpenAI from "openai";
import { Pool } from "pg";
import { createHash } from "node:crypto";
import { chunkMarkdown } from "./chunker";

const openai = new OpenAI();
const pool = new Pool({ connectionString: process.env.DATABASE_URL });
const EMBED_MODEL = "text-embedding-3-large";

export async function ingestDocument(sourceUrl: string, markdown: string) {
  const hash = createHash("sha256").update(markdown).digest("hex");
  const client = await pool.connect();

  try {
    await client.query("BEGIN");
    const existing = await client.query(
      "SELECT id, source_hash FROM documents WHERE source_url = $1",
      [sourceUrl]
    );
    if (existing.rows[0]?.source_hash === hash) {
      await client.query("COMMIT");
      return { skipped: true, chunks: 0 };
    }

    const doc = await client.query(
      `INSERT INTO documents (source_url, source_hash, updated_at)
       VALUES ($1, $2, now())
       ON CONFLICT (source_url) DO UPDATE
         SET source_hash = EXCLUDED.source_hash, updated_at = now()
       RETURNING id`,
      [sourceUrl, hash]
    );
    const documentId = doc.rows[0].id;
    await client.query("DELETE FROM chunks WHERE document_id = $1", [documentId]);

    const chunks = chunkMarkdown(markdown);
    const embed = await openai.embeddings.create({
      model: EMBED_MODEL,
      input: chunks.map((c) => c.content),
    });

    for (let i = 0; i < chunks.length; i++) {
      const c = chunks[i];
      const vector = `[${embed.data[i].embedding.join(",")}]`;
      await client.query(
        `INSERT INTO chunks
           (document_id, ord, heading_path, content, token_count, embedding)
         VALUES ($1, $2, $3, $4, $5, $6::vector)`,
        [documentId, c.ord, c.headingPath, c.content, c.tokenCount, vector]
      );
    }

    await client.query("COMMIT");
    return { skipped: false, chunks: chunks.length };
  } catch (err) {
    await client.query("ROLLBACK");
    throw err;
  } finally {
    client.release();
  }
}

Hybrid retrieval with reranking

Now the retrieval step. The function below embeds the query, runs vector and BM25-style searches in parallel, unions the candidates by chunk ID, then reranks with Cohere to produce the final top-k. This is the same shape I run in OmniAPI - the only thing missing is tenant scoping, which you would add as a metadata filter on the SQL.

// lib/retrieve.ts
import OpenAI from "openai";
import { CohereClient } from "cohere-ai";
import { Pool } from "pg";

const openai = new OpenAI();
const cohere = new CohereClient({ token: process.env.COHERE_API_KEY! });
const pool = new Pool({ connectionString: process.env.DATABASE_URL });

const CANDIDATES = 24;
const FINAL_K = 6;

export type RetrievedChunk = {
  id: number;
  content: string;
  headingPath: string;
  sourceUrl: string;
  score: number;
};

export async function retrieve(query: string): Promise<RetrievedChunk[]> {
  const embedRes = await openai.embeddings.create({
    model: "text-embedding-3-large",
    input: query,
  });
  const qVector = `[${embedRes.data[0].embedding.join(",")}]`;

  const [dense, sparse] = await Promise.all([
    pool.query(
      `SELECT c.id, c.content, c.heading_path, d.source_url,
              1 - (c.embedding <=> $1::vector) AS score
         FROM chunks c
         JOIN documents d ON d.id = c.document_id
         ORDER BY c.embedding <=> $1::vector
         LIMIT $2`,
      [qVector, CANDIDATES]
    ),
    pool.query(
      `SELECT c.id, c.content, c.heading_path, d.source_url,
              ts_rank_cd(c.content_tsv, plainto_tsquery('english', $1)) AS score
         FROM chunks c
         JOIN documents d ON d.id = c.document_id
         WHERE c.content_tsv @@ plainto_tsquery('english', $1)
         ORDER BY score DESC
         LIMIT $2`,
      [query, CANDIDATES]
    ),
  ]);

  const byId = new Map<number, RetrievedChunk>();
  for (const row of [...dense.rows, ...sparse.rows]) {
    if (!byId.has(row.id)) {
      byId.set(row.id, {
        id: row.id,
        content: row.content,
        headingPath: row.heading_path,
        sourceUrl: row.source_url,
        score: Number(row.score),
      });
    }
  }
  const candidates = [...byId.values()];
  if (candidates.length === 0) return [];

  const reranked = await cohere.rerank({
    model: "rerank-v3.5",
    query,
    documents: candidates.map((c) => c.content),
    topN: FINAL_K,
  });

  return reranked.results.map((r) => ({
    ...candidates[r.index],
    score: r.relevanceScore,
  }));
}

Prompt with citations

Grounded generation lives or dies by the prompt. The contract is explicit: the model gets numbered sources, must cite them inline, and must refuse when the sources do not contain the answer. Refusal is a feature - hallucinated answers are worse than "I don't know."

// lib/answer.ts
import OpenAI from "openai";
import { retrieve, type RetrievedChunk } from "./retrieve";

const openai = new OpenAI();

const SYSTEM = `You answer questions strictly from the provided SOURCES.

Rules:
- Cite every factual claim with [n] matching the source number.
- If the sources do not contain the answer, reply exactly: "I don't have that in the sources."
- Do not use prior knowledge. Do not invent URLs or numbers.
- Prefer direct quotes for definitions and exact values.
- Keep answers under 200 words unless the user asks for more.`;

function buildContext(chunks: RetrievedChunk[]) {
  return chunks
    .map(
      (c, i) =>
        `[${i + 1}] (${c.sourceUrl})\n${c.content}`
    )
    .join("\n\n---\n\n");
}

export async function answer(query: string) {
  const chunks = await retrieve(query);
  if (chunks.length === 0) {
    return {
      text: "I don't have that in the sources.",
      sources: [],
    };
  }

  const completion = await openai.chat.completions.create({
    model: "gpt-4o-mini",
    temperature: 0.1,
    messages: [
      { role: "system", content: SYSTEM },
      {
        role: "user",
        content: `SOURCES:\n\n${buildContext(chunks)}\n\nQUESTION: ${query}`,
      },
    ],
  });

  return {
    text: completion.choices[0].message.content ?? "",
    sources: chunks.map((c, i) => ({
      n: i + 1,
      url: c.sourceUrl,
      heading: c.headingPath,
      score: c.score,
    })),
  };
}

That is the entire pipeline. Wire it into a Next.js route handler and you have a working RAG endpoint that takes a query and returns a cited answer.

Evaluation with RAGAS-style metrics

If you do not have an eval set, you do not have a RAG system - you have a demo that nobody has measured. The four metrics that matter, copied from the RAGAS framework but easy to implement yourself:

  • Faithfulness. Of the claims in the answer, what fraction are supported by the retrieved chunks? Catches hallucination.
  • Context recall. Of the ground-truth supporting spans, what fraction were present in the retrieved chunks? Catches retrieval misses.
  • Context precision. How concentrated are the relevant chunks at the top of the retrieved list? Catches noisy retrieval that dilutes the prompt.
  • Answer relevancy. Does the answer actually address the question, or did the model wander? Catches generation drift.

Here is a minimal harness that scores a labelled set using an LLM judge. Build the labels once - 50 to 200 queries with the expected answer and the source URL it should come from - and run this on every meaningful change to your retrieval stack.

// scripts/eval.ts
import OpenAI from "openai";
import { answer } from "../lib/answer";

const openai = new OpenAI();

type EvalCase = {
  query: string;
  expectedAnswer: string;
  expectedSourceUrls: string[];
};

const cases: EvalCase[] = JSON.parse(
  await Bun.file("eval/cases.json").text()
);

const JUDGE_SYS = `You are a strict evaluator. Score the candidate answer 0-1
on three axes vs the expected answer, returning JSON:
{ "faithfulness": number, "answer_relevancy": number, "factual_match": number }`;

let totals = { faithfulness: 0, recall: 0, precision: 0, relevancy: 0 };

for (const c of cases) {
  const result = await answer(c.query);
  const retrievedUrls = result.sources.map((s) => s.url);
  const hit = c.expectedSourceUrls.filter((u) => retrievedUrls.includes(u));
  const recall = hit.length / c.expectedSourceUrls.length;
  const precision = hit.length / retrievedUrls.length;

  const judge = await openai.chat.completions.create({
    model: "gpt-4o-mini",
    response_format: { type: "json_object" },
    messages: [
      { role: "system", content: JUDGE_SYS },
      {
        role: "user",
        content: `Q: ${c.query}\nEXPECTED: ${c.expectedAnswer}\nCANDIDATE: ${result.text}`,
      },
    ],
  });
  const scores = JSON.parse(judge.choices[0].message.content ?? "{}");

  totals.faithfulness += scores.faithfulness ?? 0;
  totals.relevancy += scores.answer_relevancy ?? 0;
  totals.recall += recall;
  totals.precision += precision;
}

const n = cases.length;
console.log({
  faithfulness: totals.faithfulness / n,
  context_recall: totals.recall / n,
  context_precision: totals.precision / n,
  answer_relevancy: totals.relevancy / n,
});

Commit the eval set to the repo. Run it in CI. Block merges that drop any of the four metrics by more than a threshold (I use 3 absolute percentage points). This is the only way to refactor a retrieval stack without lighting it on fire.

The 5 failure modes I hit in production (OmniAPI case study)

OmniAPI generates working functions from natural-language descriptions, grounded in a corpus of API specs, SDK examples, and schema fragments. It is RAG end-to-end, and shipping it taught me that the failure modes nobody warns you about are concentrated in a few specific places.

1. Bad chunking destroys retrieval

The first version of OmniAPI used naive 1000-character chunks with no overlap. Queries asking about authentication kept retrieving the body of the auth section but missing the section header - so chunks looked topically related to the model but lacked the keyword grounding the prompt expected. Fixing chunking to be markdown-aware (h2/h3 boundaries) and prefixing each chunk with its heading path lifted my retrieval recall from 0.61 to 0.84 on the eval set in one afternoon. It was the single highest-leverage change I made.

2. A single embedding model never knows your jargon

Embedding models are trained on the open web. They have never seen your internal product names, schema fields, or domain shorthand. In OmniAPI's corpus, abbreviations like "PKCE" or "mTLS" would get embedded near completely unrelated cryptography terms, and queries containing them would surface plausible-sounding but wrong chunks. The fix was hybrid retrieval - BM25 finds the exact term, dense retrieval finds the surrounding concept, the reranker decides who wins. Hybrid plus rerank moved context precision from 0.42 to 0.71.

3. The 73% rule - retrieval is most of your bug surface

I went back through three months of OmniAPI bug reports and labelled each "the model is wrong" complaint by root cause. 73% were retrieval failures: the right chunk was not in the prompt, so the model had to guess. 19% were prompt issues (citation format, refusal behaviour). Only 8% were genuine generation failures where the model had the right chunks and still produced a wrong answer. The lesson: when a RAG system answers badly, fix retrieval before you touch the prompt. When you think the model is hallucinating, it is almost certainly retrieving the wrong thing.

4. Context window pollution

I assumed more chunks would always help. They do not. Past about 8 chunks, generation quality on OmniAPI starts to degrade - the model gets distracted by tangentially related passages and weaves them into the answer. The sweet spot for grounded generation is 4 to 8 reranked chunks. If you cannot answer with 8 of your best chunks, the problem is upstream (chunking or retrieval), not the chunk budget. More context is the lazy fix that makes things worse.

5. Stale index drift

Documentation changes. Schemas change. Six weeks into running OmniAPI I noticed answers referencing parameters that had been renamed in the underlying SDKs. The cause was a manual nightly reindex that had silently been failing for a week. The fix was a change-data-capture pattern: every source document write enqueues a re-embed job for just the affected chunks, plus a health endpoint that asserts the latest document timestamp is within the expected window. Treat the vector store like any other derived data - staleness is a bug, and you need monitoring that catches it.

Cost and latency budgeting

Production RAG has a predictable cost shape per query. Here is the breakdown for OmniAPI's current stack - text-embedding-3-large for embeddings, hybrid retrieval over pgvector, Cohere rerank-v3.5, and GPT-4o-mini for grounded generation:

Stepp95 latencyCost / queryNotes
Query embedding~120 ms~$0.00002One embedding of the user query
Hybrid SQL (vector + BM25)~80 ms~$0HNSW + GIN, both in parallel
Cohere rerank (24 → 6)~280 ms~$0.001Cross-encoder on top candidates
GPT-4o-mini generation~900 ms~$0.001~4K input tokens, ~200 output
Total~1.4 s~$0.002Streaming output to user starts at ~500 ms

Swap GPT-4o-mini for GPT-4o and the cost moves to roughly $0.012 per query at ~2.5 s p95. Add multi-query rewriting and you double generation cost again. Worth understanding before you commit to a model tier - AI costs compound by usage, and what looks cheap at 100 queries per day costs a fortune at 100 queries per minute. If you are early and deciding whether the math even works, my AI MVP cost breakdown covers the full picture.

When NOT to use RAG

RAG is not always the answer. Skip it when:

  • The corpus fits in context every time. A 30-page handbook used by 20 users? Dump it into the system prompt and cache it. The plumbing is not worth it.
  • The task is style or format, not knowledge. Want the model to write in your brand voice? That is fine-tuning or few-shot prompting, not retrieval.
  • The answer comes from a live system, not a document. Order status, inventory, account balance - that is tool calling against an API, not RAG over a snapshot.
  • The problem is reasoning over a small structured dataset. Hand the model the rows directly or generate SQL. RAG over tabular data is almost always worse than just giving the LLM the table.

Most real systems are hybrids. A good AI chatbot does RAG for support content, tool calling for live data, and long-context for pricing pages. The architecture choice is per-task, not per-product.

The production checklist

Before a RAG system goes near a real user, walk this list. Every item is something that bit me or someone I shipped with.

  • Eval set committed to the repo. 50+ labelled queries with expected answers and source URLs. CI runs it on every PR that touches retrieval or prompts.
  • Inline citations. Every factual claim in the answer ties to a numbered source the user can click.
  • Refusal path. "I don't know" is a first-class response when retrieval comes back empty or low-score.
  • Observability per query. Log the query, retrieved chunk IDs, rerank scores, prompt, completion, latency, and cost. You will need this the first time someone reports a bad answer.
  • Change-data-capture reindex. Document writes enqueue re-embed jobs. No nightly full reindex unless the model changed.
  • Cost alerts. Per-tenant daily token budgets with hard caps. One bad actor or one runaway client can 100x your spend.
  • Prompt injection defenses. Retrieved chunks are untrusted input. Sandwich them with system instructions, strip instruction-like syntax, and never let retrieved text override the system prompt.
  • Tenant isolation in retrieval. Every SQL query filters by tenant ID. Tested. Not just assumed.
  • Streaming responses. Time-to-first-token under 500 ms is the difference between "feels fast" and "feels broken."
  • Human-in-the-loop review for high-stakes answers. Confidence threshold below X routes to a person, not the user.
  • Fallback to non-RAG path. When retrieval times out or the LLM is down, degrade to search results or a contact form - not a 500 page.
  • Feedback capture. Thumbs up/down on every answer, stored with the full query trace. Feeds the next eval set update.

Most of these are obvious in hindsight and absent from every RAG tutorial. If you want help wiring a production system end-to-end, my AI integration work covers exactly this scope, and AI agent development when the retrieval layer needs planning, tools, and reflection on top. I work with teams worldwide and you can also hire an AI developer in Kosovo directly.

Frequently asked questions

What is RAG architecture in simple terms?

RAG (retrieval augmented generation) is a system that retrieves relevant chunks from your own data at query time and hands them to a language model as context, so the model answers from your sources instead of guessing from training data. In production it has four parts: ingestion and chunking, embedding and vector storage, retrieval (usually hybrid plus rerank), and grounded generation with citations.

Do I need RAG or can I just use a long context window?

Use long context when your corpus fits in the window every request, is fairly small, and you can pay the per-token cost. Use RAG when the corpus is larger than the window, changes often, needs citations, or when most queries only touch a tiny slice of the data. RAG is cheaper, faster, and more auditable once you have more than a few hundred pages.

What is the best chunk size for RAG?

There is no universal best chunk size, but a good default for prose and documentation is 500 to 800 tokens with 80 to 120 tokens of overlap. For code, chunk along function or class boundaries. For tables, keep the row plus header together and never split a table mid-row.

Do I need a reranker?

If your corpus is more than a few hundred documents or your users use jargon different from your source text, yes. Embedding-only retrieval will pick semantically similar but topically wrong chunks. A cross-encoder reranker like Cohere rerank-v3.5 or a BGE reranker lifts retrieval precision substantially for a few extra milliseconds per query.

Postgres pgvector or a dedicated vector DB like Qdrant?

Start with pgvector if you already run Postgres. It scales to millions of vectors with HNSW indexes, supports hybrid search with tsvector in one query, and you avoid an extra moving part. Move to Qdrant, Weaviate, or Pinecone once you need filtering at high QPS, multi-tenant isolation at scale, or specialized features like quantization at billions of vectors.

How do I evaluate a RAG system?

Build a labelled set of 50 to 200 real queries with expected answers and supporting source spans, then score each query on four axes: faithfulness (did the answer stick to retrieved context), context recall (did retrieval surface the right chunks), context precision (how much noise came with the right chunks), and answer relevancy (did the answer actually address the question). RAGAS, Ragas-style scripts, or a custom LLM judge all work.

How much does a production RAG query cost?

A typical query with text-embedding-3-large for the user query, 8 chunks retrieved and reranked to 4, and GPT-4o-mini for generation runs about $0.002 per query at p95 latency around 1.4 seconds. Heavier stacks with GPT-4o for generation and a cross-encoder reranker land around $0.012 to $0.020 per query at 2 to 3 seconds p95.

How often should I reindex the vector store?

It depends on how fast your source changes. Static documentation can run a nightly batch. Anything user-generated or business-critical should use change-data-capture: on every document write, enqueue a re-embed job for just the affected chunks. Full reindex is rarely needed unless you change the embedding model.