April 5, 2026AI Engineering11 min read

OpenAI Structured Outputs: Strict JSON Schema in 2026

By Ergini, Software & AI Developer

TL;DR

Structured outputs took JSON reliability from 40% to 100%. But schema design still bites you. This is everything I learned shipping structured-output extraction pipelines across 5 client projects.

Before structured outputs, every production JSON pipeline I shipped had a retry loop, a schema validator, and a graveyard of edge cases where the model would emit a stray {, a trailing comma, or a helpful preamble like "Sure! Here is your JSON:". JSON mode, released in 2023, fixed the syntax but not the shape - about 80% of responses matched my schema, and the other 20% paid the retry tax.

Structured outputs killed that loop. The reliability went from "mostly works" to "100% adherence" the first day I switched. Two years in, I have shipped them across five client projects - document extraction, ticket classification, agent routing, form generation, LLM-as-judge evals - and the failure surface is completely different now. This post is the production guide: schema design that actually works, refusals, streaming, migration, edge cases, and the comparison table I wish existed when I started.

What structured outputs actually do

Structured outputs constrain the model's sampling step so the emitted tokens are guaranteed to form valid JSON that matches the JSON Schema you supply. It is enforced at decode time, not validated after the fact - the model literally cannot pick a next token that would break the schema. The grammar compiles once per schema, gets cached on OpenAI's side, and from then on the structured response is free of the "will it parse" class of bug.

Read the official platform.openai.com structured outputs docs if you want the spec. The practical headline is the table below - the difference between "our parser handles three formats now" and "we deleted the parser."

Mode	Adherence	What it guarantees	When to use
Free text	~40%	Nothing. You parse a string.	Chat answers, prose, anything human-read
JSON mode	~80%	Valid JSON syntax, no schema	Legacy. Migrate.
Function calling	~99%	Input matches tool schema	Tool use, agent actions
Structured outputs (strict)	100%	Response matches your JSON Schema exactly	Extraction, classification, routing, evals

The dividing line is simple. If you want the model to do something (call an API, query a database, send an email), use function calling. If you want the model to return structured data, use structured outputs. If you want both - an agent that takes actions and produces a final structured report - use both, in the same call. Function calling is a special case of structured outputs under the hood.

Setting up with Zod (TypeScript)

The TypeScript story is clean. Define your schema once with Zod, wrap it with the OpenAI SDK's helper, and you get a fully typed response. No JSON parsing, no manual validation - the SDK does both for you and throws if the model refused.

// lib/extract-invoice.ts
import OpenAI from "openai";
import { zodResponseFormat } from "openai/helpers/zod";
import { z } from "zod";

const openai = new OpenAI();

const LineItem = z.object({
  description: z.string(),
  quantity: z.number().int().positive(),
  unitPriceCents: z.number().int().nonnegative(),
});

const Invoice = z.object({
  invoiceNumber: z.string(),
  issueDate: z.string().describe("ISO 8601 date, e.g. 2026-05-27"),
  vendor: z.object({
    name: z.string(),
    taxId: z.string().nullable(),
  }),
  lineItems: z.array(LineItem).max(50),
  totalCents: z.number().int().nonnegative(),
  currency: z.enum(["USD", "EUR", "GBP", "CHF"]),
});

export async function extractInvoice(rawText: string) {
  const completion = await openai.chat.completions.parse({
    model: "gpt-4o-2024-08-06",
    messages: [
      { role: "system", content: "Extract structured invoice data." },
      { role: "user", content: rawText },
    ],
    response_format: zodResponseFormat(Invoice, "invoice"),
  });

  const message = completion.choices[0].message;
  if (message.refusal) {
    throw new Error(`Model refused: ${message.refusal}`);
  }
  return message.parsed;
}

The return type of extractInvoice is inferred straight from the Zod schema, so downstream code is fully typed. No casts, no any, no runtime surprises. The .parse() method (note: not .create()) is what unlocks the structured response - it accepts a zodResponseFormat and returns a typed parsed field on the message.

A few details worth calling out. The model name must be a snapshot that supports structured outputs - gpt-4o-2024-08-06 or later, or any newer Omni and GPT-5 family snapshot. Older gpt-4o and gpt-4-turbo aliases silently fall back to a non-strict path and you lose the guarantee. Pin the snapshot in production and bump it on a schedule. Also worth knowing: the second argument to zodResponseFormat (the "invoice" name above) shows up in OpenAI's telemetry and request logs, so pick something descriptive - future-you will thank present-you when grepping logs.

Setting up with Pydantic (Python)

The Python story mirrors the TypeScript one, using Pydantic instead of Zod. The SDK accepts a Pydantic model directly via the response_format parameter on the parse helper, and returns a fully typed instance.

# lib/extract_invoice.py
from openai import OpenAI
from pydantic import BaseModel, Field
from typing import Literal

client = OpenAI()

class LineItem(BaseModel):
    description: str
    quantity: int = Field(gt=0)
    unit_price_cents: int = Field(ge=0)

class Vendor(BaseModel):
    name: str
    tax_id: str | None

class Invoice(BaseModel):
    invoice_number: str
    issue_date: str = Field(description="ISO 8601 date, e.g. 2026-05-27")
    vendor: Vendor
    line_items: list[LineItem] = Field(max_length=50)
    total_cents: int = Field(ge=0)
    currency: Literal["USD", "EUR", "GBP", "CHF"]

def extract_invoice(raw_text: str) -> Invoice:
    completion = client.beta.chat.completions.parse(
        model="gpt-4o-2024-08-06",
        messages=[
            {"role": "system", "content": "Extract structured invoice data."},
            {"role": "user", "content": raw_text},
        ],
        response_format=Invoice,
    )

    message = completion.choices[0].message
    if message.refusal:
        raise RuntimeError(f"Model refused: {message.refusal}")
    return message.parsed

Same shape, same guarantees. Pick the SDK that matches your runtime, not the schema library - the wire format is identical, and so is the adherence guarantee. One Python-specific quirk: the SDK lives under client.beta.chat.completions.parse rather than the top-level chat.completions namespace. That has been promoted out of beta in the JS SDK for over a year but still sits under beta in Python at the time of writing. Watch the SDK changelog if you care about that import path settling.

Schema design that actually works

The hardest part of structured outputs is not the API - it is the schema. Strict mode rewards designs the model can satisfy and punishes designs that require it to invent values. After enough extraction pipelines I have a short list of patterns that hold up in production and a longer list of anti-patterns that bite.

Enums beat free text

Any field with a closed set of valid values should be an enum, not a string. Free-text fields are where you get model creativity you do not want - "Net 30", "net30", "Net 30 days", "NET-30" all show up. An enum collapses that variance to one canonical value and lets your downstream code do exact-match dispatch.

// Bad
priority: z.string(),

// Good
priority: z.enum(["low", "medium", "high", "urgent"]),

// Better - with description to disambiguate
priority: z
  .enum(["low", "medium", "high", "urgent"])
  .describe("urgent means immediate human attention required"),

Optional vs nullable

Strict mode requires every property in your schema to be present in the response - there is no optional in JSON Schema strict mode the way Zod thinks about it. The correct pattern is .nullable(): the field is always present, but its value can be null when the model has no value to put there. This is a frequent source of confusion when migrating.

// Bad - strict mode rejects this
taxId: z.string().optional(),

// Good - always present, may be null
taxId: z.string().nullable(),

Discriminated unions for polymorphism

If your response can take one of several shapes, use a discriminated union with a literal tag field. The model picks the tag, and the schema forces the rest of the object to match. This pattern is what powers agent routing in production - the model classifies the request and produces a payload typed for that specific branch.

const RoutingDecision = z.discriminatedUnion("kind", [
  z.object({
    kind: z.literal("answer"),
    text: z.string(),
    confidence: z.number().min(0).max(1),
  }),
  z.object({
    kind: z.literal("escalate"),
    reason: z.enum(["low_confidence", "policy", "missing_data"]),
    notes: z.string(),
  }),
  z.object({
    kind: z.literal("tool_call"),
    tool: z.enum(["lookup_order", "refund", "create_ticket"]),
    args: z.record(z.string(), z.string()),
  }),
]);

Bound arrays and strings

Always cap arrays with .max(N) and strings with .max(N) where N is a sensible upper bound. Without caps the model occasionally over-extracts - turning a five-line invoice into eighty fabricated line items because the prompt wording let it. Caps are cheap, they cost zero tokens, and they fail loudly when your prompt is ambiguous.

Descriptions are instructions

The .describe() on a Zod field (or Field(description="...") in Pydantic) is fed to the model as part of the schema. Treat each one as a tiny prompt. Use them for format hints ("ISO 8601 date"), for disambiguation ("in cents, not dollars"), and for closed-domain instructions that would otherwise belong in the system prompt. Avoid narrative prose - short and imperative reads better.

Refusals - what they are and how to handle them

A refusal is the model declining to produce a structured response because the request appears to violate safety policy. Instead of the parsed object, you get a populated refusal field on the message with a short explanation. Your code must check for it on every call - pretending refusals do not happen produces silent NULL bugs in downstream tables.

const completion = await openai.chat.completions.parse({
  model: "gpt-4o-2024-08-06",
  messages,
  response_format: zodResponseFormat(Schema, "result"),
});

const message = completion.choices[0].message;

if (message.refusal) {
  // Log it. Refusals are signal, not noise.
  await logRefusal({
    userId,
    refusal: message.refusal,
    inputHash: hashInput(messages),
  });

  // Fall back to a degraded but safe path.
  return {
    status: "refused",
    reason: message.refusal,
    fallback: await safeFallbackPath(messages),
  };
}

return { status: "ok", data: message.parsed };

In two years of production usage, refusals are rare on normal business workloads (well under 0.1% of calls in extraction pipelines) but common in adjacent territories - anything touching personal data, legal text, or medical content. Log them all, review them weekly, and either reshape the prompt to remove the trigger or accept refusal as a valid outcome. Burying refusals in a try/catch is how you ship the bug where 4% of your invoices silently fail to extract.

Streaming structured outputs

Streaming structured outputs returns a sequence of partial JSON snapshots that progressively conform to the schema. Each delta either extends an existing field or starts a new one - your client gets to render incrementally instead of waiting for the full object. The Vercel AI SDK wraps this in a clean React hook.

// app/api/extract/route.ts
import { streamObject } from "ai";
import { openai } from "@ai-sdk/openai";
import { Invoice } from "@/lib/schemas";

export async function POST(req: Request) {
  const { rawText } = await req.json();

  const result = streamObject({
    model: openai("gpt-4o-2024-08-06"),
    schema: Invoice,
    prompt: `Extract invoice data from:\n\n${rawText}`,
  });

  return result.toTextStreamResponse();
}

// app/extract/page.tsx - client
"use client";
import { experimental_useObject as useObject } from "@ai-sdk/react";
import { Invoice } from "@/lib/schemas";

export default function ExtractPage() {
  const { object, submit, isLoading } = useObject({
    api: "/api/extract",
    schema: Invoice,
  });

  return (
    <div>
      <button onClick={() => submit({ rawText: "..." })}>Extract</button>
      {object?.invoiceNumber && <p>Invoice: {object.invoiceNumber}</p>}
      {object?.lineItems?.map((item, i) => (
        <div key={i}>{item?.description} - {item?.quantity}</div>
      ))}
    </div>
  );
}

Note the optional chaining on every field inside the render - during streaming, any field can be undefined for a few hundred milliseconds before it lands. The schema is the same as for the non-streaming path; only the consumer changes. For deeper coverage of the streaming primitives, see my Next.js OpenAI streaming tutorial.

Common mistakes

Patterns that look fine in a notebook and detonate in production:

Over-nesting. Schemas deeper than 3-4 levels are fragile. The model misplaces values across levels and strict mode hits OpenAI's nesting cap (5). Flatten when you can.
oneOf with many branches. Three discriminated union branches is fine, ten is not. The model gets confused about which branch fits and produces low-confidence picks. If you have ten, you actually have a classification task - do it in two steps.
Recursive schemas. Self-referential schemas ("a comment has replies which have replies") technically work via $ref but break under load. Flatten to a list with parent IDs.
Descriptions as system prompts. Field descriptions are seen but not strongly obeyed. A paragraph of instructions in a description does not replace clear messaging in the system prompt.
Forgetting strict on nested objects. Strict mode must apply to every nested object, otherwise the strict guarantee quietly breaks. The Zod and Pydantic helpers handle this for you; hand-rolled JSON Schema does not.
Confusing optional with nullable. Covered above - every field must be present, use .nullable() when the value is genuinely absent.

Migrating from JSON mode to structured outputs

The migration is small in code and large in reliability. The before-and-after for a typical classification call:

// Before - JSON mode
const completion = await openai.chat.completions.create({
  model: "gpt-4o",
  response_format: { type: "json_object" },
  messages: [
    {
      role: "system",
      content: `Classify the ticket. Reply ONLY with JSON like:
{"category": "billing|technical|account", "urgency": "low|medium|high"}`,
    },
    { role: "user", content: ticketText },
  ],
});
const raw = completion.choices[0].message.content!;
const parsed = JSON.parse(raw); // throws ~5% of the time
const validated = schema.parse(parsed); // throws another ~15%

// After - structured outputs
const Ticket = z.object({
  category: z.enum(["billing", "technical", "account"]),
  urgency: z.enum(["low", "medium", "high"]),
});

const completion = await openai.chat.completions.parse({
  model: "gpt-4o-2024-08-06",
  response_format: zodResponseFormat(Ticket, "ticket"),
  messages: [
    { role: "system", content: "Classify the support ticket." },
    { role: "user", content: ticketText },
  ],
});
const parsed = completion.choices[0].message.parsed; // never throws

Three things drop out of the system prompt: the JSON shape instructions, the "reply only with JSON" nag, and the example payload. The schema carries all of that intrinsically. My rule when migrating: cut the system prompt down to the actual task instruction in plain English, and let the schema express the shape.

5 real production use cases

1. Document extraction (PDFs to JSON)

The textbook use case. Layout-aware OCR extracts text; structured outputs map that text into a typed schema. Combined with a real document extraction pipeline you get end-to-end PDF-to-database in two model calls. The schema is the contract between OCR and the rest of your system.

2. Ticket classification

Enum-heavy schemas with confidence scores route inbound support tickets to queues. The classification is fast, the schema is small, and the discriminated-union pattern lets you add per-category fields (e.g. billing tickets carry an invoice number, technical tickets carry an environment).

3. Agent dispatch / routing

Discriminated unions excel at routing inside an agent. The model decides on the next action - answer, escalate, call a tool, ask for clarification - and produces a payload typed for that branch. See tool calling best practices and agentic RAG architecture for the broader pattern.

4. Form generation

Give the model a free-text user description and a form schema. It produces a filled-in form payload that you render directly in your UI - pre-filled invoices, contracts, applications. The schema doubles as documentation for the form builders downstream.

5. LLM-as-judge for evals

Judge schemas have a score per metric, a verdict, and a structured rationale field. Strict outputs make the judge a deterministic part of your test suite - every CI run produces parseable scores you can threshold on, with no flaky JSON parsing in the middle. The trick is keeping the schema small: one score per axis, one short rationale string, no nested objects. The narrower the judge schema, the more consistent the scores across runs. I keep judge schemas to under ten properties total and tune the rubric in the system prompt, not in layers of nested structure.

Performance and cost

Two things to know. First, first-call latency on a brand-new schema is higher - OpenAI compiles your schema into a constrained grammar, which takes 200-400ms one time. Subsequent calls with the same schema hit the cache and pay no penalty. Keep schemas stable in production so the cache stays warm; rotating schemas every deploy means every deploy eats the warm-up cost.

Second, token cost is identical to non-structured calls. Structured outputs do not add a markup. What can creep in is schema verbosity sneaking into the system prompt token count - every property, description, and enum value is fed to the model. Trim descriptions to what is necessary, and watch your input token totals before and after migration. For a deeper cost picture, my OpenAI API cost breakdown covers the per-feature math.

Error rates collapse on the migration. The retry loop that surrounded every JSON-mode call goes away. In one client pipeline that processed 50K documents per day, the post-migration parse-failure rate went from 4.2% to 0.0% - and the refusal rate (the new tail) stayed under 0.05%.

Edge cases: large schemas, recursion, polymorphism

Three places where structured outputs stops being trivial:

Very large schemas. OpenAI caps total properties at 100 across the whole schema and nesting depth at 5. If you bump those limits, split the extraction into two calls: a router call that picks what to extract, then a follow-up extraction with the narrower schema. Splitting also tends to lift quality because the model is not distracted by fields irrelevant to the input.

Recursive types. A recursive schema (tree of comments, nested folders, file system) is technically possible with Zod's z.lazy plus a $ref but practically fragile. The cleaner pattern is to flatten: return a list of nodes, each with an id and a nullable parentId. Rebuild the tree client-side. Smaller schema, better quality, simpler validation.

Polymorphism with many branches. Up to about five discriminated-union branches the model handles well. Past that, split into a two-step classification: first call picks the branch, second call extracts the branch-specific payload. The token cost is similar, the quality is markedly better, and each step is independently evaluable.

Comparison: Anthropic tool use vs OpenAI structured outputs

Both major labs ship a way to get structured data out of their models, but the shape is different. OpenAI's structured outputs is a property of the response itself; Anthropic's approach is to route structured data through their tool use feature - you define a tool whose input is your schema and force the model to call it.

Feature	OpenAI structured outputs	Anthropic tool use
Strict schema adherence	Yes (constrained sampling)	Strong but not guaranteed
Response shape	Native parsed field on message	Tool call input on a forced tool
SDK ergonomics	Zod / Pydantic helpers built in	Manual JSON Schema, parse yourself
Streaming	Partial JSON snapshots	Delta-based tool input streaming
Refusal handling	Separate refusal field	Refusal as text content

For a project that needs both providers - common when you want failover or to A/B test models - wrap both behind a single typed interface that returns your Zod or Pydantic class regardless of the underlying call shape. My Claude vs ChatGPT comparison covers the broader provider tradeoff.

Where this fits in a real product

Structured outputs is rarely the whole feature. In OmniAPI, every function spec generated for a user passes through a structured outputs call that validates the shape against the target API schema - no malformed payloads ever leave the model. In Xandidate, candidate rubric scoring is a structured outputs call that returns per-criterion scores plus a rationale, so the recruiter sees the same shape every time and the audit log stays clean. Both ship because the schema is the contract - not the prose around it.

If you are starting a new AI feature in 2026, structured outputs is the default for any model response your app consumes programmatically. JSON mode is legacy. Free-text-and-pray belongs only in human-facing chat. If you want help wiring this into a real pipeline, my AI integration work covers exactly this scope, and you can also hire an AI developer in Kosovo directly.

Frequently asked questions

What are OpenAI structured outputs?

Structured outputs is a feature on the OpenAI API that constrains the model to produce JSON that matches a JSON Schema you supply, with 100% adherence rather than the ~80% you get from JSON mode. It is enforced at the decoding layer via constrained sampling, so the model literally cannot emit a token that would break the schema. You pass a schema (or a Zod / Pydantic class), and the response is guaranteed to parse.

Structured outputs vs JSON mode vs function calling - what is the difference?

JSON mode forces valid JSON but not a specific shape, so you still get schema drift. Function calling lets the model decide whether to call a tool, with structured arguments. Structured outputs forces a specific schema on the response itself, with strict mode guaranteeing 100% adherence. In practice: use structured outputs for extraction and classification, function calling for tool use, JSON mode almost never in 2026.

What is a refusal in structured outputs?

When the model decides a request violates safety policy, it returns a refusal string in a separate refusal field instead of the structured response. Your code must check for message.refusal before trying to parse the structured payload. Treat refusals as a first-class error path, log them, and have a fallback (degraded response, escalation to a human, or a different prompt).

Can I stream structured outputs?

Yes. The OpenAI SDK and the Vercel AI SDK both support streaming partial JSON that progressively conforms to the schema. In Vercel AI SDK this is useObject on the client and streamObject on the server. The partial object updates field-by-field, which is great for UI but means downstream consumers must handle incomplete shapes gracefully.

Do structured outputs support recursive schemas?

Recursive schemas are technically supported via $ref but practically fragile, especially with deep recursion. OpenAI enforces a maximum nesting depth (5 levels) and a maximum total properties count. For tree-shaped data, prefer flattening to a list with parent IDs rather than nested children. It is more reliable, easier to validate, and easier to render.

Is there a cost or latency penalty?

First-call latency on a new schema is noticeably higher (200-400ms) because OpenAI compiles the schema into a constrained grammar. Subsequent calls with the same schema are cached and free of that penalty. Token costs are unchanged. For high-throughput endpoints, keep your schemas stable so the cache stays warm.

How are structured outputs different from Anthropic tool use?

Anthropic does not have a direct equivalent. Their tool use enforces the input schema for tool calls (similar to function calling) but the assistant's text reply itself is not schema-constrained. To get strict structured output from Claude, you wrap your schema as a tool and force tool use. It works but is less ergonomic. OpenAI's structured outputs is the cleanest way to get strict JSON from a major model today.

Should I migrate from JSON mode to structured outputs?

Yes, almost always. The migration is a few lines (swap response_format from json_object to a zodResponseFormat or json_schema), and the reliability lift is significant - JSON mode breaks ~20% of the time on complex schemas in my benchmarks, structured outputs breaks 0%. Keep your existing validation as a defense-in-depth layer, but the parse failures essentially go away.