AI SaaS Architecture: Patterns from 5 Shipped Products
By Ergini, Software & AI Developer in Pristina, Kosovo
TL;DR
Every AI SaaS converges on the same 8 architectural decisions. Here they are - multi-tenancy, rate limits, model routing, evals, BYO keys, cost attribution, observability, prompt versioning - with the patterns I use in production.
Five shipped AI SaaS products in, the pattern is hard to miss. The landing pages are different. The pricing is different. The market is different. The architecture converges. There are roughly 8 decisions every AI SaaS makes - usually badly the first time, then well after the second outage - and the products that survive are the ones that got the architecture right early enough that the feature work stayed cheap. This post is what those 8 decisions are, how I make them now, and the minimum viable stack I reach for on every new build.
If you have not shipped any SaaS yet, the SaaS MVP tech stack post covers the non-AI scaffolding - auth, billing, hosting - that sits underneath everything here. Read it first if you are starting from zero; the architecture below assumes those pieces are already wired.
The 8 decisions every AI SaaS converges on
Before the code, the TL;DR of what this post argues. Every AI SaaS I have shipped or audited makes the same 8 architectural calls. Get them right and your feature work stays cheap. Get them wrong and you end up rewriting your data layer at 50 tenants and your routing layer at 500. The decisions:
- Multi-tenancy model. Row-level by default; per-tenant prompt versioning and per-tenant vector namespaces from day one.
- Per-user rate limits and cost ceilings. Daily dollar caps before request caps. The bill is what kills you.
- Model routing. Cheap model for the 80% of easy traffic, expensive for the hard 20%. Router itself must be tiny.
- Eval pipeline in CI. Bad prompts get blocked at merge, not after a customer complains.
- BYO-key support. Optional, but you need to know when to offer it and how to store the key without becoming a liability.
- Cost attribution per user and tenant. Per-call telemetry that maps directly to who you bill.
- Prompt versioning and rollback. Prompts are code. Treat them like code - git, review, rollback.
- Failure isolation. One tenant's bad inputs must not break or even slow down anyone else.
The rest of this post walks each one with the pattern I actually use, the failure mode it prevents, and where I have seen it bite real clients.
Decision 1: Multi-tenancy
There are two real choices: schema-per-tenant or row-level isolation with a tenant_id column on every relevant table. For 95% of AI SaaS the answer is row-level. Schema-per-tenant only earns its operational weight when you have a regulatory requirement that forces it (single-tenant data residency) or you are selling five-figure-a-month enterprise where each customer has bespoke schemas. For everything else, row-level wins on simplicity, migration speed, and cross-tenant analytics.
The AI-specific multi-tenancy decisions are the ones that bite. Two of them:
- Per-tenant prompt versioning. Enterprise tenants ask for custom prompts within six months. If you have hardcoded a single global prompt, the migration to per-tenant prompts touches every call site. Build the prompt loader to take a tenant_id from the start and default to the global version when the tenant has not overridden anything.
- Per-tenant vector store namespace. Sharing embeddings across tenants is the most common AI SaaS data leak I see. A single mis-scoped query returns another tenant's chunks. Use a namespace per tenant (Pinecone), a collection per tenant (Qdrant), or a tenant_id column with a NOT NULL filter on every query (pgvector). Test the isolation with a failing assertion in CI that runs a query as tenant A and verifies no tenant B rows come back.
A workable schema for the AI run log that everything else hangs off:
-- core multi-tenant AI tables
create table tenants (
id uuid primary key,
name text not null,
plan text not null default 'free',
daily_cost_cap_usd numeric not null default 5.00,
created_at timestamptz default now()
);
create table prompts (
id uuid primary key,
tenant_id uuid references tenants(id), -- null = global default
name text not null,
version int not null,
body text not null,
created_at timestamptz default now(),
unique (tenant_id, name, version)
);
create table runs (
id uuid primary key,
tenant_id uuid not null references tenants(id),
user_id uuid not null,
feature text not null,
prompt_id uuid references prompts(id),
model text not null,
input_tokens int not null,
output_tokens int not null,
cost_usd numeric not null,
latency_ms int not null,
created_at timestamptz default now()
);
create index on runs (tenant_id, created_at desc);
create index on runs (tenant_id, user_id, created_at desc);Three tables and two indexes. Every AI SaaS I have shipped has some version of these - the prompts table is what makes versioning and rollback possible, and the runs table is what makes cost attribution and observability possible. Build them in the first migration. Going back later to backfill tenant_id on a production runs table is the kind of thing that ruins a quarter.
Decision 2: Per-user rate limits and cost ceilings
Request-rate limits are not enough. A single user looping a script against your endpoint can run up $400 in spend before your rate limiter trips - because each request costs $0.05 and the limiter only counts requests, not dollars. The pattern that actually keeps you solvent is layered:
- Request-rate limit per user. 60 to 600 per minute, depending on the feature. Stops scripted abuse.
- Daily cost ceiling per user. Soft cap (warn) at 80% of plan limit, hard cap (block) at 100%.
- Daily cost ceiling per tenant. Sum across all users in the tenant; same soft/hard split. One tenant's bad user must not drain the tenant's budget either.
- Per-feature ceiling. Expensive features (agentic flows, long-context generations) get their own narrower cap so one feature cannot starve the rest.
A minimal middleware that enforces all four against Postgres runs:
// lib/spend-guard.ts
import { sql } from "@/lib/db";
export async function assertWithinBudget(opts: {
tenantId: string;
userId: string;
feature: string;
estimatedCostUsd: number;
}) {
const { tenantId, userId, feature, estimatedCostUsd } = opts;
const [tenant] = await sql`
select daily_cost_cap_usd from tenants where id = ${tenantId}
`;
if (!tenant) throw new Error("tenant_not_found");
const [{ spent }] = await sql`
select coalesce(sum(cost_usd), 0) as spent
from runs
where tenant_id = ${tenantId}
and created_at > now() - interval '1 day'
`;
if (Number(spent) + estimatedCostUsd > Number(tenant.daily_cost_cap_usd)) {
throw new Error("tenant_daily_cap_exceeded");
}
const [{ user_spent }] = await sql`
select coalesce(sum(cost_usd), 0) as user_spent
from runs
where tenant_id = ${tenantId}
and user_id = ${userId}
and created_at > now() - interval '1 day'
`;
const userCap = Number(tenant.daily_cost_cap_usd) / 5; // policy: per-user is 20% of tenant
if (Number(user_spent) + estimatedCostUsd > userCap) {
throw new Error("user_daily_cap_exceeded");
}
// feature-specific cap (read from config)
// ...
}The estimate is the part people skip. Calling the model first and then noticing you blew the cap costs you the call. Compute the worst-case cost (max output tokens at the routed model's rate) before the call, fail fast if it overruns, and log the actual cost after. For the deeper math on per-call pricing, my OpenAI API cost breakdown has the rate tables and the patterns that cut bills 60%.
Decision 3: Model routing
The 80/20 rule is the law of AI SaaS economics. 80% of requests are easy and a small model handles them; 20% are hard and need a stronger model. Routing every request to the strong model is the most common reason an AI SaaS has a margin problem. Routing every request to the cheap model is the most common reason quality complaints stack up. The router itself has to be cheap to compute, or you eat the savings on routing overhead.
Three routing strategies, in order of complexity:
- Heuristic router. Token count, presence of code blocks, feature flag - a deterministic switch that picks the model. Zero LLM overhead. Works for the majority of cases.
- Classifier router. A tiny classifier (small model, fixed prompt, JSON output) labels the request as easy or hard. 100 to 300 ms overhead. Worth it when heuristics keep mis-routing.
- Self-judge fallback. Run the cheap model, have it score its own confidence, retry on the strong model if confidence is below threshold. Slowest but most accurate; reserve for high-stakes features.
// lib/route-model.ts
type Tier = "cheap" | "strong";
export function pickModel(input: {
feature: string;
inputTokens: number;
userPlan: "free" | "pro" | "enterprise";
forceTier?: Tier;
}): { model: string; tier: Tier } {
if (input.forceTier === "strong") return { model: "gpt-5", tier: "strong" };
// enterprise plan always gets the strong model on premium features
if (input.userPlan === "enterprise" && input.feature === "report") {
return { model: "gpt-5", tier: "strong" };
}
// long-context requests must go to the strong model
if (input.inputTokens > 20_000) {
return { model: "gpt-5", tier: "strong" };
}
// hard features that empirically need the strong model
const HARD_FEATURES = new Set(["sql_generation", "code_review", "legal_summary"]);
if (HARD_FEATURES.has(input.feature)) {
return { model: "gpt-5", tier: "strong" };
}
// default: cheap
return { model: "gpt-5-mini", tier: "cheap" };
}Note what the router does not do: it does not call any model. The decision is made in microseconds from request metadata. The classifier-router and self-judge patterns are upgrades you bolt on only when this heuristic version starts mis-routing badly enough that an eval drop is visible.
Decision 4: Eval pipeline in CI
Prompts are code. The reason most AI SaaS teams ship bad prompts is the same reason teams without CI ship bad code: nothing checked the change against a known-good behavior before merge. An eval pipeline in CI fixes this. The shape:
- A versioned
prompts/directory in the repo, one file per prompt, with the prompt body and the metadata. - A versioned
evals/directory with a fixed eval set per prompt - inputs, expected outputs, and the metrics that matter for this prompt (accuracy, faithfulness, refusal rate). - A GitHub Action that runs the eval set against any changed prompt on every PR, scores against thresholds, and posts the diff as a PR comment.
- A merge gate that blocks the PR if any metric regresses by more than a configured slack (typically 2 percentage points).
The framework choice matters less than the discipline. DeepEval, Braintrust, RAGAS, and Promptfoo all do the job; pick by which one plugs into your CI fastest. The breakdown of which fits which workload is in my LLM evaluation framework comparison. The decision to ship eval-in-CI at all is the one that separates products that keep getting better from products that plateau and ship slow regressions for years.
Pair this with LLM observability tools in production - Helicone or Langfuse - so the eval set itself can grow from real failures sampled out of production rather than synthetic cases written once and never updated.
Decision 5: BYO-key support
Bring-your-own-key lets a user plug in their own OpenAI or Anthropic key and have all their requests billed to that key instead of yours. It is a real feature for the right buyer and a trap for the wrong one. The decision matrix:
| Offer BYO-key when | Skip BYO-key when |
|---|---|
| Your users are technical (developers, ops, data teams) | Your users are non-technical or consumer |
| Your margins on managed keys are thin or negative | Your managed-key margins are healthy (50%+) |
| Enterprise buyers ask for it in security reviews | Onboarding friction would drop conversion materially |
| You sell into regulated industries (finance, healthcare) | Your product depends on prompt caching across tenants |
If you do offer it, the storage matters more than the feature itself. The minimum bar: encrypt keys at rest with a per-tenant envelope key, never log the key in plaintext anywhere, and rotate the encryption keys on a schedule. Use a hosted KMS - AWS KMS, Google Cloud KMS, or Vercel's integrations with managed secret stores - rather than rolling your own. A leaked customer API key is a much bigger incident than a leaked customer password, because the customer pays for it in dollars.
Decision 6: Cost attribution per user and tenant
Every model call has to be tagged on the way out and accounted on the way back. The metadata that has to land in the runs table:
- tenant_id - who pays.
- user_id - who triggered it.
- feature - which product feature called the model.
- prompt_id - which prompt version was used.
- model - which model handled the call.
- input_tokens, output_tokens - usage counters returned by the provider.
- cost_usd - computed from a model-price table, not hardcoded.
- latency_ms - end-to-end wall clock for the call.
That row is the atom of every dashboard you will ever want to build: per-tenant revenue/cost margin, per-feature cost contribution, per-model latency, per-user abuse detection. Skipping any one of these fields makes a whole class of question unanswerable until you backfill. Use Helicone or Langfuse if you do not want to build the rollup queries yourself; either way, the field schema is the same.
The model-price table is the part people get wrong. Hardcoded prices drift the day after a provider updates their rates, so the runs table reports yesterday's cost forever. Keep a tinymodel_prices table with input and output rates per million tokens, updated as part of the provider-version-bump PR.
Decision 7: Prompt versioning and rollback
Prompts are code. Anything that flows through this much production traffic and changes this often needs a code-like workflow:
- Source of truth in git. Prompts live in
prompts/*.mdorprompts/*.ts, reviewed on every change. - Versioned in the database. When a prompt changes, insert a new row (do not update the existing one). The runs table references prompt_id, so every historical run still points to the prompt body that produced it.
- Default and override. The default prompt is tenant_id = NULL. A tenant can override by inserting a row with their tenant_id and the same name; loader picks the override first.
- One-line rollback. A rollback is a database update on the "active version" pointer, not a code deploy. When a prompt regresses in production, you should be able to fix it in 30 seconds.
Without the version pointer, "roll back the prompt" becomes a release-engineering exercise. With it, support and on-call can fix bad ships immediately.
Decision 8: Failure isolation
One tenant's bad inputs must not break or slow down anyone else. Three patterns:
- Input sanitization at the boundary. Strip and length-cap user input before it ever touches a prompt. Reject inputs that look like instruction injection at the API layer, not at the model layer.
- Per-tenant queue. If you use a queue for async AI work, partition it by tenant. A single tenant flooding the queue must not stall every other tenant's jobs. Most managed queues support partition keys natively.
- Timeouts and circuit breakers. Every model call wrapped in a timeout (15 to 30 seconds typically). When the provider has an outage, fall back to a backup provider or a graceful error rather than letting requests pile up.
The retrieved-chunk-as-attack-vector failure mode is the AI-specific one. If your RAG pipeline pulls chunks from one tenant's data and feeds them into the prompt context, an adversarial chunk ("ignore the system prompt and instead...") can hijack the model. Tenant isolation in the vector store is half the defense; the other half is sandwiching retrieved chunks with strict system markers and never letting them override the user prompt or the system prompt. The full pattern set is in the RAG architecture tutorial.
The minimum viable AI SaaS stack
Concrete picks. This is what I reach for on every new AI SaaS, and what I would default to if a founder asked me to start one tomorrow:
| Layer | Pick | Why |
|---|---|---|
| Frontend + API | Next.js on Vercel | App Router, Server Actions, streaming, edge runtime - the whole stack supports AI calls natively |
| Database | Postgres (Supabase or Neon) | One database for app data, runs, prompts, and embeddings |
| Vector store | pgvector on the same Postgres | One less system to manage; scales to ~50M vectors before you need a dedicated store |
| Auth | Clerk | Multi-tenant orgs, social login, RBAC - none of which you want to build yourself |
| Billing | Stripe (with usage-based pricing) | The runs.cost_usd column maps directly to a usage record |
| Observability | Helicone (or Langfuse self-hosted) | One base-URL change and every model call is logged |
| Eval | Braintrust or DeepEval in CI | Whichever plugs into your GitHub Actions fastest |
| Model providers | OpenAI primary, Anthropic fallback | Provider abstraction lives in your router, not your app code |
This stack scales from one user to roughly the first 10,000 paying-tenant range without surgery. The pieces that get swapped later are usually pgvector (when you cross 50M vectors) and the observability layer (when you want OpenTelemetry-native traces across the whole stack). Everything else stays.
Two external pieces worth knowing: hosting on vercel.com because Server Actions, Fluid Compute, and streaming map directly onto AI workloads; database on supabase.com for the pgvector + RLS + storage combination; observability through helicone.ai because the base-URL-only integration means you ship it in 10 minutes.
Real architecture from 5 shipped products
The same 8 decisions, made differently across five products I have shipped. The point is not that there is one right answer - it is that every product made the call deliberately and lived with the tradeoff.
| Product | Tenancy | Routing | Vector store | BYO-key |
|---|---|---|---|---|
| Caldra (AI scheduling) | Row-level, per-user calendars | Heuristic - small model only | None (no RAG) | No |
| OmniAPI (function generator) | Row-level, per-tenant prompts | Heuristic + self-judge fallback | pgvector, namespaced per tenant | Yes (technical buyers) |
| Xandidate (AI screening) | Row-level, per-tenant rubrics | Strong model only (high stakes) | pgvector for job descriptions | No |
| DreamCurtains (design AI) | Row-level, consumer | Heuristic, cheap model only | None | No |
| Lindi (internal AI tooling) | Single-tenant initially | Classifier router across 3 models | Qdrant (heavy retrieval) | Internal keys only |
Notice the pattern: every product made the cheap call where the feature did not demand more. Caldra never needed a vector store. DreamCurtains never needed BYO-key. Xandidate never needed model routing because the cost of a wrong answer was high enough to justify the strong model on every request. The 8 decisions are always made; the answers should be appropriate to the product, not copy-pasted from a reference architecture.
Anti-patterns I keep seeing
Every one of these I have either hit myself or watched a client hit in the last year. They look harmless at small scale and turn into rewrites at the wrong time.
- Global hardcoded prompts. One prompt for all tenants, lives in a string constant. Migration cost when the first enterprise asks for a custom prompt: high. Fix: prompts table from commit one.
- No eval pipeline. Every prompt change ships blind. The first regression-driven outage is the moment the team realizes the gap; by then production has the bad prompt for a week. Fix: eval-in-CI before the first paid customer.
- Single-tenant scaling assumptions. Queries that do not filter by tenant_id work fine until tenant 50 and a customer sees another customer's data. Fix: row-level security policies in Postgres that enforce tenant_id at the database, not at the application.
- No cost attribution. The bill arrives, nobody can answer which feature or which tenant drove the spike. Fix: runs table with the eight fields from Decision 6, from commit one.
- Routing every request to the strong model.Margin problem. Quality is the same on 80% of traffic; you are paying 5x for the privilege of consistency. Fix: heuristic router from week one, refined as evals justify.
- Shared vector store across tenants. The data leak you do not see until it is in a support ticket. Fix: namespace per tenant, enforced with a CI assertion.
- No daily cost cap. One bug, one looped script, one bad actor - and the monthly bill is 100x baseline. Fix: per-tenant and per-user daily caps from launch.
- Prompts that cannot be rolled back. A bad ship becomes a 4-hour incident instead of a 30-second toggle. Fix: prompts versioned in the database with an active-version pointer.
For founders specifically, the deeper context on cost, architecture, and where to hire to get this built right lives in the AI integration and MVP development service pages. If you want a senior engineer to scope the build directly, you can also hire an AI developer in Kosovo or start at the homepage to see what I currently ship.
Frequently asked questions
What is AI SaaS architecture in plain terms?
AI SaaS architecture is the set of structural decisions that make a multi-tenant product safe to run when the core feature is an LLM call. It covers how tenants are isolated, how per-user spend is capped, how the right model is picked per request, how prompts ship like code, and how cost and quality are attributed back to each customer. The application code looks like any other SaaS; the architecture differs in the rate-limiting, routing, eval, and observability layers wrapped around the model call.
Do I need multi-tenancy on day one of an AI SaaS?
Yes, even if you only have three users. The two changes that matter on day one are a tenant_id on every row that touches AI artifacts (prompts, runs, embeddings) and a per-tenant namespace in the vector store. Retrofitting tenant isolation on a year-old AI SaaS with shared embeddings is the most expensive migration I get hired for. The auth layer can stay simple; the data layer has to be tenant-aware from commit one.
When should I offer BYO-key for my AI SaaS?
Offer BYO-key when your average user is technical enough to have an OpenAI key, your unit economics on managed keys are tight or negative, or you sell into regulated buyers who need their data on their own contract. Skip BYO-key when your users are consumer or non-technical, your margins on managed keys are healthy, or your product depends on prompt caching across tenants. BYO-key adds real complexity to billing, routing, and support - it is not free flexibility.
How do I attribute LLM cost back to individual users?
Tag every model call with tenant_id, user_id, feature, and model on the way out, log the prompt and completion token counts on the way back, and store them in a runs table you can roll up nightly. Helicone, Langfuse, and Braintrust all do this if you front your provider calls through them; rolling your own is also fine when you control the call site. The trap is computing cost from token counts using last quarter pricing - keep a model-price table you update when providers change rates.
How do model routing decisions work in production?
A router classifies the request - cheaply, either via a small model or a heuristic - then picks the cheapest model that can handle it. For 70 to 80% of traffic, a small fast model is enough. The remaining 20 to 30% routes to a stronger model when the classifier says the task is genuinely hard, when the cheap model self-judges its answer as low confidence, or when the user is on a paid tier that gets the premium model by default. The router itself has to be tiny - adding 800 ms to every call to save 200 ms downstream is a loss.
What does an eval pipeline look like in CI?
Each prompt or chain lives in a versioned file. A GitHub Action runs a fixed eval set against the updated prompt before merge, scores it against thresholds for accuracy, faithfulness, and refusal rate, and blocks the merge if any threshold regresses by more than a configured slack. The same harness runs nightly against a larger production-sampled set to catch slow drift. Without this, prompt changes ship blind and the only signal you get is the support ticket queue.
How do I prevent one bad tenant from breaking the system?
Per-tenant rate limits, per-tenant daily cost ceilings, and per-tenant queue isolation. The first stops abusive request volume, the second stops abusive spend, and the third stops one tenant filling the inference queue and starving everyone else. On top of those, sanitize inputs before they hit the model, and never let one tenant's retrieved chunks influence another tenant's prompt context. Failure isolation is mostly defensive code, not infrastructure.
What is the minimum viable AI SaaS stack in 2026?
Next.js on Vercel for the app, Postgres with pgvector for primary data and embeddings, Clerk for auth, Stripe for billing, and Helicone or Langfuse for observability. The model providers are whatever you route across - OpenAI, Anthropic, Google. Eight people out of ten can ship a working AI SaaS on that stack in under a month, and the parts you swap later are usually the vector store and the observability layer, not the core.