Founders10 min read

AI MVP Checklist: 30 Things to Decide Before You Code

By Ergini, Software & AI Developer in Pristina, Kosovo

TL;DR

Most AI MVPs fail at architectural decisions made in week 1. This is the 30-question doc I make every client fill out - and the patterns I see in the answers that predict whether the build ships.

Why the checklist matters

Most AI MVPs do not fail because the model was wrong. They fail because three or four architectural decisions made in week one - often without anyone noticing they were decisions - compound into a product that ships late, costs three times the projection, and quietly degrades the moment real users touch it.

I have come in to rescue enough of these to recognize the pattern. The founder hires a generalist who builds a single-provider chat wrapper. Three months later the provider has an outage, evals do not exist, the prompt has been edited 40 times without a regression test, costs are 2x projection because nothing is cached, and PII is being logged into a third-party observability tool that nobody read the DPA for. Each of those is a 30-second decision at week one and a multi-week rewrite at month four.

This is the 30-item checklist I run with every client before code starts. It is organized into seven parts. Each item should have a defensible one-paragraph answer in your spec doc. If you cannot answer one in three minutes, that is the item worth investigating first.

Part 1: Product (5 items)

1. What outcome does the AI produce?

Not the input - the output. Is it a draft, a decision, an action, a recommendation, a summary, a classification, a retrieval, or an extraction? Founders skip this because it feels obvious, but the answer determines the entire shape of the product. A draft has a review surface. A decision has an audit log. An action has a rollback. A classification has a labelled training set. Pick one primary outcome per surface; secondary outcomes go into a future-roadmap doc, not the MVP scope.

2. What is the failure mode tolerance?

AI systems are probabilistic. They will be wrong. What happens when they are? Three tolerance bands cover almost every product:

  • Low tolerance (medical, legal, financial, irreversible actions): every output requires human review. The AI is a drafting tool, not a decision tool.
  • Medium tolerance (CRM updates, email drafts, internal summaries): outputs can ship autonomously with logging and a one-click undo, but errors are visible and recoverable.
  • High tolerance (search ranking, suggestion ordering, copy variations): errors are invisible to users in any single instance and only matter in aggregate.

Most founders default to high-tolerance assumptions and ship into a low-tolerance domain. Pick the band honestly; it determines whether the next 25 items even apply.

3. Who is the user and what do they expect?

AI literacy varies wildly by audience. A developer using your API will tolerate a 4-second latency and an occasional refusal. A non-technical operations user will not. A consumer expects ChatGPT polish. A compliance officer expects an audit trail. Write the user persona in one sentence, then write the three things they will be most disappointed by if missing - those become non-negotiables.

4. What does "good" look like (3 examples)?

Write three input-output pairs that represent a perfect response. Not aspirational ones - realistic ones. These become the first three rows of your eval set, the reference points for prompt iteration, and the demo material for sales conversations. If the team cannot agree on three concrete examples of "good", the product is not yet specified enough to build.

5. What does the MVP NOT do?

Write the explicit non-goals. "The MVP does not support multi-language input." "The MVP does not retain conversation history past 24 hours." "The MVP does not send email autonomously." This list is the single best defence against scope creep mid-build. Anything not on the do-list and not on the do-not-list is the ambiguous middle that eats schedule.

Part 2: Model + provider (4 items)

6. Pick primary provider (OpenAI / Anthropic / Other)

Pick one as the default and write down why. The defensible reasons in 2026:

  • OpenAI: best general-purpose multimodal, strongest structured output reliability, broadest ecosystem.
  • Anthropic: best at long-context reasoning, tool use fidelity, and code-heavy workloads.
  • Google (Gemini): best at very long context with heavy document grounding and lowest cost per million tokens at high volume.
  • Open-weight (Llama, Qwen, DeepSeek via Together / Fireworks): best when latency, cost floor, or data residency rules out the frontier providers.

Do not pick three. Pick one and treat the second as a fallback, nothing more.

7. Pick fallback provider

Frontier providers all have multi-hour outages every quarter. Your product breaks during them unless you have an abstraction that can route to a second provider. The fallback does not need feature parity - it needs to be good enough to keep the product alive. The cleanest pattern is one TypeScript abstraction (or use the Vercel AI SDK) that lets you swap provider with one environment variable. Decide the fallback now; the engineering cost is half a day at week one and two weeks at month six.

8. Cost ceiling per request

Write a number. Not a vibe - an actual ceiling in cents per request. $0.005, $0.05, $0.50. The number drives model choice, context budget, caching strategy, and whether you can offer a free tier at all. Without it you will discover the answer the month your first invoice arrives. The OpenAI API cost breakdown has the math for translating your ceiling into a model and context budget.

9. Latency budget (p95)

Write the worst acceptable latency for 95% of requests. A conversational UI tolerates 4 seconds. A search box tolerates 800ms. A background job tolerates 30 seconds. The p95 budget determines whether you can use a reasoning model, whether you need streaming, whether you can afford reranking, and whether long-context retrieval is even on the table. Write it down before architecture decisions get made on vibes.

Part 3: Data (4 items)

10. What data does the AI need to see?

Enumerate every input the model will touch: user message, prior conversation, system prompt, retrieved documents, tool results, user profile, account metadata. For each, write down whether it is always present, sometimes present, or only present in advanced flows. The enumeration usually reveals two or three inputs the team had not discussed yet - those are the ones that will surface as bugs in week six otherwise.

11. Where does it live (and can we touch it)?

For every input from item 10, write the source system, the API or connector that exposes it, the latency to fetch it, and whether access is already provisioned. The most common surprise: a data source the founder assumed was accessible turns out to require a three-month vendor security review. Better to discover that in week one than after the architecture is committed.

12. PII / compliance constraints

Three questions, each with a written answer:

  • Does the input contain PII, PHI, or financial data? If yes, your provider DPA and BAA requirements just narrowed your model list.
  • Are you covered by GDPR, HIPAA, SOC 2, or sector-specific regulation? Each one changes what you can log, where data can be processed, and how long you can retain it.
  • Do you need data residency in a specific region? EU customers increasingly require it; some US sectors do too.

Get this wrong and the product is illegal to sell to your target customer. Get this right and you have a moat against competitors who cut the corner.

13. Retention policy

Write down: how long do you store prompts, completions, embeddings, and any cached responses? Who can access them? Do they leak into your observability tool? Are they used by the provider for training (most providers default to no on API plans, but verify)? A defensible default is 30 days for raw logs, indefinite for aggregated metrics, and zero retention for any PII-laden field. The policy goes in the privacy notice, which goes live with the MVP.

Part 4: Evaluation (4 items)

14. 50-100 labelled examples committed to repo

The single highest-leverage practice in AI engineering. Fifty rows of input plus expected output (or a rubric) checked into the repo as a JSON or CSV file. The set grows over time, but the day-one minimum is fifty. Without it, every prompt change is a guess and every model swap is a leap of faith. The cost to build it is one focused afternoon. The cost of not having it is the rest of the project.

15. Pass/fail rubric for each example

Each example needs a deterministic or LLM-as-judge way to score it. Pure equality works for classification. Substring or regex works for extraction. LLM-as-judge with a written rubric works for free-form generation. The rubric is more important than the scorer - write it in English first, codify it second. The LLM evaluation framework post compares the tools that wrap this; the rubric work is the same regardless of tool.

16. Eval runs on every PR

The eval is wired into CI. Every pull request that touches a prompt, a model parameter, or any code in the AI path triggers the eval and reports the delta. This catches the silent regressions that otherwise show up as user complaints two weeks after a seemingly harmless prompt tweak. GitHub Actions plus your eval framework of choice gets you there in an afternoon.

17. Eval gate before deploy

The eval not only runs - it gates. A defined regression threshold (say, no more than 5% drop on any metric) blocks the deploy until an engineer acknowledges the regression. This is the difference between an eval that catches problems and an eval that prevents them.

Part 5: Architecture (4 items)

18. Workflow vs agent

If you can draw your process as a flowchart, build a workflow. If the path branches based on judgment, build an agent. Workflows are cheaper, faster, more debuggable, and ship in days. Agents are more flexible, more expensive, and ship in weeks. The default for most MVPs should be workflow, occasionally with one agent-shaped node embedded inside it. There is a longer treatment in the AI workflow vs agent post - if you have not committed to the answer yet, read that before this item.

19. RAG / fine-tune / long-context / tool / hybrid

How does the model get the knowledge it needs?

  • RAG: default for any knowledge base over 50 pages or any case where freshness matters. The RAG architecture tutorial walks the full pipeline.
  • Long context: fits when the knowledge is under 200K tokens, slowly changing, and re-sent on every request. Simpler than RAG but pricier at scale.
  • Tools: when the knowledge lives in live APIs - calendar availability, current pricing, account state.
  • Fine-tune: rarely needed at MVP stage. Use only for tone or domain-language adaptation, never to add knowledge.
  • Hybrid: the production answer for most real products. RAG for the knowledge base, tools for the live data, maybe a small fine-tune for tone.

20. Streaming or batch

Streaming is for any user-facing interaction over one second; batch is for everything else. The decision affects API choice, UI design, error handling, and how cancellations work. A surprising amount of AI infra (rate limiters, caches, queues) behaves differently in each mode. Pick now and design once.

21. Sync or async

Sync means the user waits. Async means the user is notified later. The right answer depends on the latency budget from item 9 and the complexity of the work. Any flow that includes more than two tool-call rounds, document retrieval against more than 10K documents, or any model in a reasoning mode should default to async with a notification. Sync is a UX choice that becomes an infra constraint; pick it deliberately.

Part 6: Safety + HITL (4 items)

22. Where can the AI act autonomously vs needs approval

Draw a table. Columns: action types the AI can take. Rows: autonomous, requires approval, never permitted. Sending an internal Slack message might be autonomous; sending a customer-facing email requires approval; transferring money is never permitted. Every item the AI does should land somewhere on the table, with the decision written down. This becomes the contract for the human-in-the-loop UI from human-in-the-loop AI.

23. Prompt injection threat model

Three questions:

  • Can a user input text that becomes part of the prompt?
  • Can the model read untrusted external content (web pages, emails, PDFs)?
  • Does the model have tools that can take consequential action?

If any two answers are yes, you have a prompt injection threat model and you need separation between trusted instructions, untrusted content, and tool authorization. Skip this and the first adversarial user is a security incident.

24. Refusal policy

When should the model refuse? Write the list: off-topic, abusive, out-of-scope, illegal, privacy-violating. For each, write the refusal phrasing and the user-visible follow-up. A clear refusal policy in the system prompt prevents both over-refusal (which users hate) and under-refusal (which lawyers hate).

25. Audit log

Every model call gets logged with: timestamp, user ID, prompt, completion, model, latency, cost, and outcome (succeeded, refused, failed). Stored for the duration in the retention policy from item 13. The log is non-negotiable for any product with compliance requirements and a lifesaver for the first user-reported issue regardless of compliance.

Part 7: Ops (5 items)

26. Per-request observability

Every request is traceable end-to-end: which model, which prompt version, which retrieved documents, which tool calls, which final output, how long each step took, how much it cost. Langfuse, LangSmith, Helicone, and Braintrust all do this; pick one and instrument on day one. Trying to debug an AI system without traces is like debugging a database without query logs.

27. Cost monitoring + alerts

Daily and monthly cost dashboards by environment and by user cohort. Alert on a 50% deviation from baseline. The alert catches runaway loops, prompt injection used to drain your budget, and accidental model upgrades. The first time a user finds a way to make your agent call itself recursively, the alert is what tells you before the invoice does.

28. Rate limits per user/tenant

Per-user and per-tenant ceilings on requests per minute, requests per day, and dollar spend per day. Without these, one bad actor or one buggy customer integration burns your model budget in an hour. Upstash, Vercel KV, or a simple Postgres counter all work; the pattern is more important than the tool.

29. Model versioning + rollback

Prompts are versioned, model parameters are versioned, retrieval configs are versioned, and you can roll back to any prior version in under five minutes. Git tags work; prompt management tools work better. The first time a prompt change breaks production at 11pm, the rollback path is what determines whether it is a 10-minute incident or a four-hour outage.

30. On-call for AI-specific failures

Who responds when the model provider has an outage at 3am? When eval scores drop overnight? When cost spikes 5x? AI failures look nothing like normal infrastructure failures and require someone who understands the stack. Put the rotation in writing, even if the rotation is one person - that person should know they are responsible.

Skip if you must - the 8 items you can defer past v1

Most of the 30 items are not deferrable in the sense that you will pay the cost either now or later - usually with interest. The following 8, however, can defensibly wait until you have product signal and paying users, provided you write down that you have deferred them:

  • Item 7 (fallback provider): defer only if your product can tolerate a four-hour outage. Internal tools usually can; customer-facing chat usually cannot.
  • Item 12 (compliance): defer only if you are shipping to friendly beta users who have signed an NDA. The day you charge a regulated buyer, this stops being deferrable.
  • Item 17 (eval gate): defer the gating, not the eval itself. Run evals manually on PRs until you have shipped three prompt regressions, then wire up the gate.
  • Item 21 (sync vs async): default sync at MVP, move long-running work to async when latency complaints start.
  • Item 25 (audit log): log to a simple Postgres table at MVP; upgrade to immutable storage when compliance shows up.
  • Item 27 (cost alerts): a daily Slack message of yesterday's spend is fine at MVP; real alerting can wait two months.
  • Item 28 (rate limits): defer per-tenant limits if you have fewer than 10 users; never defer per-user limits.
  • Item 30 (on-call): the founder is the on-call at MVP. Formalize the rotation when the team grows past three.

Everything else on the checklist will cost more to add later than to do now. The whole point of the checklist is to make the decision visible - defer with intent, not by accident.

Download / template

Here is the full 30-item checklist as plain markdown. Copy it into your spec doc, answer each item in a paragraph, and commit it alongside your code. If a future engineer has to make a decision you made implicitly, the answers are right there.

  • Product: outcome / failure tolerance / user persona / 3 good examples / explicit non-goals
  • Model: primary provider / fallback provider / cost ceiling per request / p95 latency budget
  • Data: inputs enumerated / source systems and access / PII and compliance / retention policy
  • Evaluation: 50+ labelled examples / pass-fail rubric / runs on every PR / gates deploys
  • Architecture: workflow vs agent / RAG vs fine-tune vs long-context vs tools / streaming vs batch / sync vs async
  • Safety: autonomous vs approved actions / injection threat model / refusal policy / audit log
  • Ops: per-request observability / cost monitoring and alerts / per-user rate limits / model versioning and rollback / on-call coverage

The MVP this checklist supports lives inside a normal product scope. The cost ranges for the surrounding build are in build MVP for startup cost, and the boring-but-load-bearing stack choices around it are in the SaaS MVP tech stack post. If you want me to run the checklist with you and build the thing afterwards, the full menu is on MVP development services and the AI-specific work is on AI integration services. You can hire me directly via hire an AI developer in Kosovo - same fixed-scope shape across any of the case studies on the home page.

Frequently asked questions

What is an AI MVP checklist and why do I need one?

An AI MVP checklist is the set of architectural, product, and operational decisions you should lock in before writing code. It matters more than for a normal MVP because AI systems are probabilistic - bad week-one decisions about evals, fallback models, or data retention compound into rebuilds at month three. The 30 items above cover the failure modes I see most often in client rescues.

How long should it take to fill out the checklist?

Two focused sessions of 90 minutes each, ideally with the founder and the lead engineer in the same room. The first session covers product, model, and data. The second covers eval, architecture, safety, and ops. Anything you cannot answer in three minutes per item is a flag worth investigating before code starts.

Can I skip the eval section if my MVP is small?

No. The eval section is the single highest-leverage part of the checklist. A 50-example labelled set that runs on every PR catches more regressions than any other practice in AI engineering. Skipping it is the most reliable way to ship an AI MVP that silently degrades after the third prompt change.

Do I really need a fallback model on day one?

Yes if your product breaks when the primary provider has an outage. OpenAI, Anthropic, and Google all have multi-hour incidents at least quarterly. A fallback is a 20-line abstraction at day one and a 2-week refactor at month six. Cost is near zero until you actually fail over.

How do I pick between RAG, fine-tuning, long context, and tools?

Default to RAG plus tools for almost everything. Long context is a fallback when your knowledge base is small enough to fit in 200K tokens and freshness matters. Fine-tuning is for behavior shaping or domain language, not for adding knowledge. Most production systems end up hybrid - RAG for retrieval, tools for actions, a light fine-tune for tone if needed.

What is the most expensive item to retrofit later?

The human-in-the-loop approval surface. If your AI takes autonomous action against external systems - sending email, writing to a CRM, moving money, posting publicly - the review queue you wish you had built in week one becomes a six-week project in month four. Build the surface even if you do not enforce it yet.

How many of the 30 items can I genuinely defer?

About 8 of them, listed in the skip-if-you-must section above. The other 22 are not deferrable in the sense that you will pay back the decision either now or later. Deferring them does not save time; it moves cost from build to rescue.

Do I need this checklist if I am buying instead of building?

Yes, slightly trimmed. The product, data, safety, and ops sections still apply when you wire up a SaaS AI tool. The model, architecture, and eval sections compress because the vendor owns those - but you still need to know their answers before you commit to a multi-year contract.