AI Agent Design Patterns: Reflection, Planning, Tool Use
By Ergini, Software & AI Developer in Pristina, Kosovo
TL;DR
Most things called agents should be workflows. The four patterns that justify true agentic behavior are reflection, planning, multi-agent debate, and tool use. Here is where each one earns its complexity, with shipped examples.
Most things called agents should be workflows. That is the single most useful sentence in Anthropic's Building Effective Agents post, and it is the first thing I tell every client who walks in wanting to ship an agentic system. A workflow is a deterministic pipeline where you the engineer wrote the steps. An agent is a loop where the model picks the next step. Workflows win on cost, latency, and reliability whenever the steps are knowable in advance. Agents win only when they are not - and most production tasks make the steps knowable if you look hard enough.
This post is the set of agent design patterns I actually use after shipping three production systems that contain real agentic loops: OmniAPI's function-generation layer, Caldra's outreach planner, and Xandidate's candidate research agent. If you are still deciding whether to go agentic at all, start with the comparison in AI workflow vs agent first, then come back here for the patterns.
When agentic complexity is justified
Four conditions have to hold before agentic behaviour beats a deterministic pipeline. Miss any one and the workflow wins:
- The input space is too large to enumerate. If you can list the request types on a whiteboard, route them with a classifier and ship a workflow per type. Agents earn their cost when the request shape is genuinely open-ended.
- The next step depends on the previous output. If step three is "always call the same API with the result of step two," that is a workflow with two stages. If step three could be any of eight tools depending on what step two returned, that is an agent.
- The tool surface is broad enough that routing dominates orchestration. With three tools you can hardcode a router. With twenty, the model picking is cheaper than your if-else tree and handles edge cases better.
- You accept higher cost and variance. Agents are 5 to 50x the cost of an equivalent workflow and their p95 latency is measured in tens of seconds. If your product has tight latency budgets or thin margins, the workflow is the answer even if it covers 80% instead of 90% of cases.
If all four hold, agentic is genuinely the right tool. The remaining question is which pattern. Here are the six that ship.
Pattern 1 - Reflection
Reflection is the simplest agentic pattern: generate a draft, critique it against an explicit rubric, then refine. The critique step is the load-bearing piece - without it, you are running three identical samples and praying. With a sharp rubric, reflection routinely lifts output quality 10 to 25 percentage points on tasks where correctness is verifiable but not trivially so: code review, summary fidelity, plan soundness, draft email tone.
A minimal reflection loop using the Vercel AI SDK:
// lib/reflection.ts
import { generateText } from "ai";
import { openai } from "@ai-sdk/openai";
const MAX_REFINES = 2;
export async function reflectAndRefine(task: string) {
const draft = await generateText({
model: openai(`gpt-4o-mini`),
prompt: `Task: ${task}\n\nWrite a first draft.`,
});
let current = draft.text;
for (let i = 0; i < MAX_REFINES; i++) {
const critique = await generateText({
model: openai(`gpt-4o-mini`),
prompt: `Task: ${task}\n\nDraft:\n${current}\n\nCritique against this rubric:
1. Does it answer the task literally?
2. Are claims grounded?
3. Is it the right length and tone?
Return JSON: { "issues": string[], "good_enough": boolean }`,
});
const parsed = JSON.parse(critique.text);
if (parsed.good_enough || parsed.issues.length === 0) return current;
const refined = await generateText({
model: openai(`gpt-4o-mini`),
prompt: `Task: ${task}\n\nDraft:\n${current}\n\nIssues to fix:\n${parsed.issues.join("\n")}\n\nRewrite the draft fixing each issue.`,
});
current = refined.text;
}
return current;
}Two details matter. The critique has to use a different prompt than the generator, or the model just rubber-stamps its own work. And the max-refine cap (MAX_REFINES = 2) is non-negotiable - three passes is the most reflection gains before you hit diminishing returns, and unbounded reflection is one of the most expensive ways to burn tokens.
Pattern 2 - Planning (planner-executor)
The planner-executor pattern splits the agent into two roles: a planner that produces a high-level plan from the task, and an executor that runs each step. The planner is called once at the start (and occasionally re-called on failure). The executor handles the per-step tool calls and state.
This pattern wins when the task is decomposable into 3 to 10 steps that the model can reason about up front. Research workflows, multi-stage data pulls, and report generation are the usual fit. The cost-control win is real: a single planning call early lets you reject the plan before paying for execution.
// lib/planner-executor.ts
import { generateObject, generateText, tool } from "ai";
import { openai } from "@ai-sdk/openai";
import { z } from "zod";
const PlanSchema = z.object({
steps: z.array(
z.object({
id: z.string(),
goal: z.string(),
tool: z.enum([`search`, `fetch`, `summarise`]),
})
),
});
export async function planAndExecute(task: string) {
const { object: plan } = await generateObject({
model: openai(`gpt-4o`),
schema: PlanSchema,
prompt: `Task: ${task}\n\nProduce a step plan (3 to 8 steps).`,
});
const results: Record<string, string> = {};
for (const step of plan.steps) {
const { text } = await generateText({
model: openai(`gpt-4o-mini`),
tools: stepTools,
prompt: `Step goal: ${step.goal}\n\nPrevious results:\n${JSON.stringify(results)}\n\nUse the ${step.tool} tool to complete this step.`,
});
results[step.id] = text;
}
return results;
}The planning call uses a stronger model (gpt-4o), the per-step calls use a cheaper one (gpt-4o-mini). That asymmetry is intentional - the plan is high-leverage and benefits from the smarter model, while the executor is doing mechanical work and does not.
Pattern 3 - ReAct
ReAct (Reason + Act) interleaves thought, action, and observation in a single loop. The model writes a reasoning trace, calls a tool, reads the result, reasons again, and either calls another tool or answers. It is the default pattern for modern tool-using agents because it maps directly onto the function-calling APIs from OpenAI and Anthropic, and the multi-step tool loop in the Vercel AI SDK.
// lib/react-agent.ts
import { generateText, stepCountIs, tool } from "ai";
import { openai } from "@ai-sdk/openai";
import { z } from "zod";
const tools = {
search: tool({
description: `Search the knowledge base. Use for factual questions about our docs.`,
inputSchema: z.object({ query: z.string() }),
execute: async ({ query }) => searchDocs(query),
}),
sql: tool({
description: `Run a read-only SQL query. Use for current numbers from the DB.`,
inputSchema: z.object({ sql: z.string() }),
execute: async ({ sql }) => runSql(sql),
}),
fetch: tool({
description: `Fetch a URL. Use only for known-safe internal URLs.`,
inputSchema: z.object({ url: z.string().url() }),
execute: async ({ url }) => fetchUrl(url),
}),
};
export async function reactAgent(question: string) {
const result = await generateText({
model: openai(`gpt-4o-mini`),
tools,
stopWhen: stepCountIs(8),
system: `You are an analyst. Think step by step. Call tools when you need data. Stop and answer when you have enough.`,
prompt: question,
});
return { text: result.text, steps: result.steps };
}The stopWhen: stepCountIs(8) guard is the most important line in that file. ReAct without a step cap is the textbook way to ship an outage. For tool-calling depth, also see tool calling best practices - the failure modes of ReAct are mostly failure modes of badly described tools.
Pattern 4 - Multi-agent debate / debate-and-vote
Multi-agent debate runs N agents in parallel on the same task, then either votes on the answer (majority wins) or runs a debate round where each agent sees the others' answers and revises. This pattern wins on judgment-heavy tasks where independent reasoning genuinely beats single-thread reasoning: code review, hiring decisions, policy questions, anything where reasonable experts could disagree.
The math: three parallel agents at 4o-mini cost roughly the same as one gpt-4o call, with often higher quality on subjective tasks because you get three independent samples instead of one biased one. The risk: debate can converge on a confidently wrong consensus if the prompt biases all agents the same way. Use diverse prompts (different personas, different rubrics) or even different models for the agents to keep the samples independent.
// lib/debate.ts
import { generateText } from "ai";
import { openai } from "@ai-sdk/openai";
const personas = [
`You are a strict reviewer who weights correctness above all else.`,
`You are a pragmatic reviewer who weights shippability and clarity.`,
`You are a security-minded reviewer who looks for footguns.`,
];
export async function debateAndVote(task: string) {
const drafts = await Promise.all(
personas.map((sys) =>
generateText({
model: openai(`gpt-4o-mini`),
system: sys,
prompt: task,
}).then((r) => r.text)
)
);
const judge = await generateText({
model: openai(`gpt-4o`),
prompt: `Three reviewers answered the task below. Synthesise the best answer, citing where each reviewer was right.\n\nTask: ${task}\n\nReviewer 1:\n${drafts[0]}\n\nReviewer 2:\n${drafts[1]}\n\nReviewer 3:\n${drafts[2]}`,
});
return { final: judge.text, drafts };
}Skip debate for retrieval-style tasks. There is no judgment to debate - the answer is in the corpus or it is not. Save debate for the cases where the model has to weigh trade-offs.
Pattern 5 - Hierarchical agents (manager + workers)
Hierarchical agents put a manager agent on top that decomposes the task into subtasks and dispatches each to a specialised worker agent. The manager keeps the top-level plan and aggregates results. Each worker is a focused ReAct agent with its own narrower tool surface.
This pattern earns its complexity when the task is too broad to fit one agent's context window or tool roster. Research projects spanning multiple domains, multi-team workflows, and long-horizon investigations are the canonical fit. The cost is steep - you are essentially running N agents and an orchestrator - and the debuggability is worse than any other pattern. Reach for it only when single-agent ReAct visibly runs out of room. The closest thing to a robust framework for this is LangGraph from LangChain, which models the manager-worker topology as a stateful graph.
Pattern 6 - Tool use as the primary loop
The thinnest form of agent: barely any planning, no reflection, no debate, just a model with a clean tool surface running a tight loop until it can answer. This is the right pattern for the largest single class of production agents - anything that is fundamentally "fetch some data, transform it, return it" with a model choosing which fetch and transform calls to make.
The whole engineering effort lives in the tool definitions, not the agent loop. Three tools written like product specs (clear name, clear description, strict schema, predictable errors) beat ten tools written like an afterthought. If you are building tools, see building an MCP server in TypeScript for the cleanest way to expose them - MCP gives you a transport-neutral tool boundary that any agent can consume.
Comparison: when each pattern earns its keep
| Pattern | Complexity | Cost / task | p95 latency | When to use |
|---|---|---|---|---|
| Reflection | Low | ~$0.005 to $0.02 | 3 to 6 s | Quality matters, rubric is writable |
| Planning (planner-executor) | Medium | ~$0.02 to $0.08 | 5 to 15 s | 3 to 10 step tasks, plan is reviewable |
| ReAct | Low to medium | ~$0.01 to $0.10 | 2 to 20 s | Default for tool-using agents |
| Multi-agent debate | Medium | ~$0.03 to $0.15 | 5 to 12 s | Judgment-heavy, subjective tasks |
| Hierarchical (manager + workers) | High | ~$0.20 to $5+ | 30 s to minutes | Broad scope, single agent runs out of room |
| Tool-use loop | Low | ~$0.005 to $0.05 | 2 to 10 s | Fetch-and-transform with clean tools |
Reflection plus tool use is the workhorse combination in production. Most of the agents I ship are a ReAct loop with a one-shot reflection pass at the end on the final answer, and that covers 80% of real use cases at low cost and bounded latency.
Combining patterns
Real production agents almost always compose two patterns. The combinations I reach for, in order of frequency:
- ReAct + reflection on final answer. Tool-using agent runs the loop, then a single critique-and-refine pass cleans up the output. Adds one extra call, lifts quality meaningfully. This is the default for customer-facing agents.
- Planning + ReAct executor. A planner emits the step list, each step is executed by a focused ReAct sub-loop with its own tools. Better than monolithic ReAct when the task is long.
- Planning + debate on the plan. Generate three plans in parallel, vote on the best one, then execute. Worth it when the plan is high-stakes - the cost is paid once at the top and saves you from executing a bad plan to completion.
- ReAct + human-in-the-loop gates. Agent runs autonomously up to a checkpoint, then pauses for human approval before continuing with side-effectful tools. See human-in-the-loop AI for the patterns that ship.
Anti-patterns that burn time and money
Every agentic system I have shipped has had at least one of these bugs at some point. Knowing the shape ahead of time saves you the outage.
Endless loops
The agent calls a tool, reads the result, decides it needs another call, fires it, reads, decides again, forever. Without a hard step cap enforced in code (not a prompt suggestion), this is how every agent eventually fails. stepCountIs(8) in the Vercel AI SDK, explicit counter in your own loop - pick one, enforce it, alert when it triggers.
Agents for everything
The most expensive anti-pattern in the field right now. A team has one agentic use case that works, so they retrofit every flow into an agentic loop. The login flow becomes an agent. The settings page becomes an agent. Every one of those is a workflow with the cost and unreliability of an agent. Audit your agentic surface every quarter and push everything that does not need the flexibility back to deterministic code.
No evaluation gate
Shipping a prompt change to an agent without running it against a labelled eval set is shipping blind. Agents are sensitive to small prompt changes in ways that are hard to predict, and a regression that drops trajectory efficiency by 20% can be invisible in spot-checks. CI eval is not optional for agentic systems.
No max-step or cost ceiling
I have watched teams ship agents with no per-request cost ceiling and then get a $40,000 bill from a single user running a bad query in a loop. Always compute the expected max cost (max steps times per-step cost) and either reject the request or alert when it crosses a threshold. Per-tenant daily caps are equally non-negotiable.
Tool descriptions written by engineers, not product people
Tool selection accuracy lives or dies on the description string. A tool described as "Run SQL" gets called for everything. A tool described as "Run a read-only SQL query against the analytics warehouse. Use when the question asks for counts, sums, or trends over time. Do not use for current user state - that lives in the API." gets called when it should be called. Write descriptions like sales copy: what the tool does, when to use it, when not to.
Building blocks I always include
Independent of which pattern you pick, every agentic system in production needs the same five scaffolding pieces. Skip any of them and you will install them later under pressure during an incident.
- Hard step cap. Enforced in code. 5 to 10 for most ReAct loops, 3 for reflection, 3 to 5 for planner-executor sub-steps.
- Cost ceiling per request. Compute the worst case at request time and reject the request if it exceeds the budget. Cheaper than letting it run.
- Per-tenant daily cost cap. Hard dollar limit per tenant per day. Single biggest defence against billing surprises.
- Trajectory observability. Log every step's model input, tool name, tool arguments, tool result, and latency. You will need every byte the first time a user reports a bad answer.
- Human-in-the-loop gates on side effects. Read-only tools run autonomously. Write tools (send email, write to DB, charge card) pause for human approval. The cost of getting this wrong is much higher than the friction of asking.
Real cases: which pattern each product uses
Three production examples, each using a different pattern, to ground the trade-offs.
OmniAPI uses a tool-use loop with reflection on the final output. The agent generates an API function from a natural language description by calling a retrieval tool against a corpus of OpenAPI specs and SDK examples, drafting the function, then running a critique pass that checks the function against the retrieved spec before returning. ReAct loop capped at 6 steps, one reflection pass, about $0.03 per request. No planner, no debate - the task is bounded enough that a thin agent wins.
Caldra uses planner-executor for outreach orchestration. The planner reads a campaign brief and produces a per-prospect plan (which channels to use, what messages to send in what order, what to do if a reply comes in). The executor runs the plan, calling the channel-specific tools per step. Plan is reviewable and editable by the operator before execution starts - that human-in-the-loop gate is what makes the planner-executor pattern defensible for a side-effectful task.
Xandidate uses multi-agent debate on judgment-heavy candidate evaluation. Three agents with different personas (technical rigor, culture fit, growth potential) score the same candidate independently, then a synthesiser produces the final recommendation with the dissents called out. Debate wins here because the task is genuinely subjective and the dissents are themselves valuable signal for the hiring team.
None of these systems use hierarchical agents. The scope of each task fits inside one agent's context window and tool surface, and the added complexity has never justified itself in my work. Hierarchical is the pattern most likely to be overkill - try every other combination first.
Where to go from here
If you are building retrieval-heavy agents, the architecture work lives in agentic RAG architecture - those patterns and these patterns compose. If you are still comparing agentic to a workflow tier, the deeper trade-off analysis is in AI workflow vs agent. And if you have settled on an agentic system but the tool layer is still rough, the patterns that actually keep tool selection accurate are in tool calling best practices.
If you want help wiring an agent end-to-end, my AI agent development work covers the pattern selection, guardrails, and observability, and AI integration when the agent has to plug into existing systems. I work with teams worldwide and you can also hire an AI developer in Kosovo directly.
Frequently asked questions
What are the core AI agent design patterns?
The patterns that actually earn their complexity in production are reflection (generate, critique, refine), planning (a planner decomposes the task, an executor runs the steps), ReAct (interleaved reasoning and tool calls in one loop), multi-agent debate (parallel agents argue or vote on an answer), hierarchical agents (a manager dispatches to specialised workers), and tool-use loops (a thin agent that mostly just calls tools). Most production systems compose two of these, not all six, and the majority of so-called agents should actually be deterministic workflows.
When should I build an agent instead of a workflow?
Anthropic's Building Effective Agents post is right: workflows beat agents on cost, latency, and reliability whenever the steps are knowable in advance. Reach for true agentic behaviour only when the input space is too large to enumerate, the next step genuinely depends on the previous output, the tool surface is broad enough that hardcoding routes is worse than letting the model pick, and you accept higher cost and variance in exchange for flexibility. If even one of those four conditions fails, ship the workflow.
What is the ReAct pattern in plain terms?
ReAct interleaves thought, action, and observation in a single loop. The model writes a short reasoning trace, calls a tool, reads the result, reasons again, calls another tool, and so on until it answers. It is the default pattern for modern tool-using agents because it maps directly onto the function-calling APIs from OpenAI, Anthropic, and the Vercel AI SDK. The danger is unbounded loops, so always cap the step count and instrument the trace.
How is reflection different from just retrying the prompt?
A retry sends the same input through the same prompt and hopes the sample lands better. Reflection generates a draft, runs a critique step that names specific defects against an explicit rubric, and then runs a refine step that takes the draft plus the critique as input. The critique is the load-bearing piece - without it you are paying for three calls to get the same answer. Reflection is worth the cost when output quality matters more than latency and you can write a sharp rubric.
Do multi-agent systems actually outperform single agents?
Sometimes, and only for tasks where independent reasoning beats single-thread reasoning. Debate-and-vote helps on judgment-heavy work like code review and policy decisions. Hierarchical agents help when scope is broad enough that one agent runs out of context. For most retrieval and tool-use tasks, a single well-prompted ReAct agent beats a multi-agent system on cost, latency, and debuggability. The literature is biased toward novelty, so do not assume more agents means better answers - measure it.
How do I keep an agent from looping forever?
Hard step cap enforced in code, not in the prompt. A cost ceiling per request that aborts the loop before it blows the budget. A wall-clock timeout for user-facing flows. Falling tool-call thresholds so the agent has to commit on later steps. A forced-answer prompt on the final step that tells the model to respond with what it has. Without those five guards, every agentic system eventually has a runaway-loop incident.
Which agent framework should I use?
For TypeScript, the Vercel AI SDK gives you tool calling, multi-step loops, and streaming with a clean primitive surface and no orchestration lock-in. For Python with complex graph orchestration, LangGraph from LangChain is the most mature stateful-agent framework. Skip the heavier all-in-one platforms until you have actually felt the pain - most agent systems I ship are 300 to 800 lines of TypeScript on top of the AI SDK, no framework required.
How do I evaluate an AI agent?
Evaluate trajectories, not just final answers. Score whether the agent picked the right tools, whether the steps were necessary, whether the loop terminated cleanly, and whether the cost per task stayed within budget. Add task-level success metrics on a labelled set, and run them in CI on every prompt or model change. Without trajectory evals you cannot tell the difference between a regression and a flaky sample, and you cannot ship agentic systems with confidence.