Tutorials10 min read

Stream OpenAI Responses in Next.js 15 (2026 Tutorial)

By Ergini, Software & AI Developer in Pristina, Kosovo

TL;DR

Most streaming tutorials are out of date. This is the 2026 pattern: Server Actions, useChat, structured outputs, and tool calls - in one minimal Next.js 15 app, with the full code.

Most tutorials on streaming OpenAI responses in Next.js were written against Pages Router and an OpenAI SDK that no longer exists. The copy-paste examples you find on the first page of Google still mention openai-edge, OpenAIStream, and the old Vercel AI SDK 2.x StreamingTextResponse helper. All of it is dead. The 2026 way is shorter, typed, cancellable, and works the same in a Server Action or a Route Handler. This tutorial walks through the whole thing in one minimal Next.js 15 app, with every piece of code you need to ship.

Why most tutorials are out of date

The Vercel AI SDK shipped v3 in early 2024, v4 mid-2024, v5 in 2025, and v6 in early 2026. Each major version rewrote the streaming primitives. The old OpenAIStream helper that wrapped a raw OpenAI response and emitted plain text is gone. The new world isstreamText, streamObject, and a typed data-stream protocol that carries text, tool calls, tool results, annotations, and errors as discrete parts. If you copy a 2023 tutorial into a Next.js 15 app today, it will not even compile - and if you patch it just enough to compile, you will rebuild bad versions of features the SDK already ships.

Next.js itself changed too. Server Actions went from experimental to the default form-handling primitive. The App Router's Route Handlers now stream cleanly under Fluid Compute. The whole runtime story shifted: Edge is no longer the default answer for low-latency AI endpoints because Node + Fluid Compute now offers comparable cold-start behavior with full Node compatibility. The result is a 2026 stack that is simpler, faster, and cheaper than what most blog posts describe - if you know which pieces to use.

The 2026 way

Three pieces. Next.js 15 App Router for the framework, Vercel AI SDK 6 for the streaming primitives, and your provider of choice for the model. We will use OpenAI as the primary and configure Anthropic as a fallback to show the multi-provider pattern. The AI SDK abstracts the provider so swapping models is a one-line change - useful for cost control, useful when a provider has an outage, and useful for A/B testing model quality on real traffic.

On the server we will use streamText from the AI SDK, return its response with toDataStreamResponse, and expose it through both a Route Handler and a Server Action so you can see both patterns. On the client we will use the useChat hook for free-form chat and useObject for streaming structured outputs. Both hooks handle message state, deltas, abort signals, and error rendering. Tool calls show up in the same stream and render through the message-parts API without any extra plumbing.

Setup

Start from a fresh Next.js 15 app. The streaming features need nothing exotic - just the AI SDK, the OpenAI provider, and a Zod copy for schemas. Add the Anthropic provider too if you want the fallback path.

npx create-next-app@latest my-streaming-app --typescript --app --tailwind
cd my-streaming-app
npm install ai @ai-sdk/openai @ai-sdk/anthropic @ai-sdk/react zod

Your package.json dependencies block should look like this. The ai package is the framework-agnostic core,@ai-sdk/react ships the React hooks, and the provider packages are thin wrappers that translate the SDK's unified message format into each vendor's API.

{
  "dependencies": {
    "next": "^15.0.0",
    "react": "^19.0.0",
    "react-dom": "^19.0.0",
    "ai": "^6.0.0",
    "@ai-sdk/openai": "^2.0.0",
    "@ai-sdk/anthropic": "^2.0.0",
    "@ai-sdk/react": "^2.0.0",
    "zod": "^3.24.0"
  }
}

Set up your environment variables. The provider SDKs read these by convention - no manual wiring needed. If you are running locally, put them in .env.local; on Vercel, add them through the project dashboard or the AI Gateway which proxies all providers through one base URL and gives you unified billing.

# .env.local
OPENAI_API_KEY=sk-...
ANTHROPIC_API_KEY=sk-ant-...

Basic text streaming with streamText

The smallest useful streaming endpoint is a POST that takes a prompt and streams back the model's response. In Next.js 15 with the AI SDK that is 20 lines of code, and it works the same in development and on Vercel production.

// app/api/chat/route.ts
import { openai } from "@ai-sdk/openai";
import { streamText, convertToModelMessages } from "ai";

export const runtime = "nodejs";
export const maxDuration = 300;

export async function POST(req: Request) {
  const { messages } = await req.json();

  const result = streamText({
    model: openai("gpt-5"),
    system: "You are a concise senior engineer. Answer in under 200 words.",
    messages: convertToModelMessages(messages),
  });

  return result.toDataStreamResponse();
}

Three things to notice. The runtime export pins this to Node - required for the AI SDK's richer features and the right default in 2026. The maxDuration raises the function timeout to five minutes because long generations on big models can run well past the default 10 seconds. And toDataStreamResponse emits the SDK's structured stream protocol - text deltas, tool calls, tool results, errors, and finish signals as discrete typed parts. The client hooks decode this automatically.

Server Actions vs Route Handlers

Both can stream. Pick by use case. Route Handlers give you a stable URL the browser hits with a normal POST. They are the right choice for chat-style UIs, public APIs, and anything you want to call from a non-React client. Server Actions are bound to a React tree - invoke them from a form or a component, get back a typed return value, and React handles the network call invisibly. They are the right choice when streaming is part of a form submission, when you want to read cookies and session state in the same call, or when the response will drive Server Components rendered on the same request.

Here is the same generation as a Server Action. Notice the return type - the AI SDK ships createStreamableValue andcreateUIMessageStream helpers that wrap the stream in a serializable value React can render with readStreamableValue on the client.

// app/actions.ts
"use server";

import { openai } from "@ai-sdk/openai";
import { streamText } from "ai";
import { createStreamableValue } from "ai/rsc";

export async function generate(prompt: string) {
  const stream = createStreamableValue("");

  (async () => {
    const result = streamText({
      model: openai("gpt-5-mini"),
      prompt,
    });

    for await (const delta of result.textStream) {
      stream.update(delta);
    }

    stream.done();
  })();

  return { output: stream.value };
}

My rule: chat surfaces use Route Handlers, one-shot form-driven generations use Server Actions. The form pattern is especially clean - submit a form, the action streams, and the result fills a Suspense boundary as it arrives. No client state to manage.

Client UI with useChat

The useChat hook is the reason most teams pick the AI SDK over rolling their own. Thirty lines of React get you a chat UI with message history, optimistic updates, streaming deltas, abort, error handling, and tool-call rendering. Point it at the Route Handler from earlier and the rest is plumbing.

// app/chat/page.tsx
"use client";

import { useChat } from "@ai-sdk/react";

export default function ChatPage() {
  const { messages, input, handleInputChange, handleSubmit, status, stop, error } =
    useChat({ api: "/api/chat" });

  return (
    <div className="mx-auto max-w-2xl space-y-4 p-6">
      {messages.map((m) => (
        <div key={m.id} className={m.role === "user" ? "text-right" : ""}>
          <div className="text-xs text-gray-500">{m.role}</div>
          <div className="whitespace-pre-wrap">
            {m.parts.map((part, i) =>
              part.type === "text" ? <span key={i}>{part.text}</span> : null
            )}
          </div>
        </div>
      ))}

      {error && <div className="text-red-600">Error: {error.message}</div>}

      <form onSubmit={handleSubmit} className="flex gap-2">
        <input
          value={input}
          onChange={handleInputChange}
          className="flex-1 rounded border px-3 py-2"
          placeholder="Ask anything"
          disabled={status !== "ready"}
        />
        {status === "streaming" ? (
          <button type="button" onClick={stop} className="rounded bg-gray-200 px-4">
            Stop
          </button>
        ) : (
          <button type="submit" className="rounded bg-black px-4 text-white">
            Send
          </button>
        )}
      </form>
    </div>
  );
}

The hook exposes status as a state machine - ready, submitted, streaming, error - which makes UI states trivial. The messages array contains structured parts, not raw strings, so tool calls, images, and reasoning blocks render in the same loop without special casing.

Streaming structured outputs with useObject

Free-form text is one mode. The other mode every real product needs is structured extraction - give the model a schema, get back a typed object. useObject streams the object as it is generated, so the UI fills in field by field instead of blocking on the full response. I cover the deeper schema design patterns in my OpenAI structured outputs post - this is the streaming counterpart.

// app/api/extract/route.ts
import { openai } from "@ai-sdk/openai";
import { streamObject } from "ai";
import { z } from "zod";

export const runtime = "nodejs";

export const recipeSchema = z.object({
  name: z.string(),
  ingredients: z.array(z.object({ item: z.string(), amount: z.string() })),
  steps: z.array(z.string()),
});

export async function POST(req: Request) {
  const { prompt } = await req.json();

  const result = streamObject({
    model: openai("gpt-5"),
    schema: recipeSchema,
    prompt: `Generate a recipe for: ${prompt}`,
  });

  return result.toTextStreamResponse();
}
// app/extract/page.tsx
"use client";

import { experimental_useObject as useObject } from "@ai-sdk/react";
import { recipeSchema } from "../api/extract/route";

export default function ExtractPage() {
  const { object, submit, isLoading } = useObject({
    api: "/api/extract",
    schema: recipeSchema,
  });

  return (
    <div className="mx-auto max-w-2xl p-6">
      <button onClick={() => submit({ prompt: "carbonara" })} disabled={isLoading}>
        Generate
      </button>

      {object?.name && <h2 className="mt-4 text-xl font-bold">{object.name}</h2>}
      {object?.ingredients?.map((i, idx) => (
        <div key={idx}>{i?.amount} {i?.item}</div>
      ))}
      <ol>{object?.steps?.map((s, idx) => <li key={idx}>{s}</li>)}</ol>
    </div>
  );
}

The schema is shared between client and server - same Zod object, same validation, no drift. The streamed object is partial until the final chunk arrives, which is why every field access in the render uses optional chaining. That ergonomic cost buys you a UX that feels instant even on a 30-second generation.

Tool calling in the same stream

Tools turn a chat model into an agent. Define them with Zod, register them with streamText, and the model will emit tool-call parts that the client renders alongside text. The AI SDK handles the back-and-forth - model calls tool, server runs it, result goes back into the conversation, model continues - without any orchestration code on your side. For the deeper tool design patterns, see the Vercel AI SDK tool calling guide.

// app/api/chat/route.ts
import { openai } from "@ai-sdk/openai";
import { streamText, tool, convertToModelMessages, stepCountIs } from "ai";
import { z } from "zod";

export const runtime = "nodejs";
export const maxDuration = 300;

export async function POST(req: Request) {
  const { messages } = await req.json();

  const result = streamText({
    model: openai("gpt-5"),
    messages: convertToModelMessages(messages),
    stopWhen: stepCountIs(5),
    tools: {
      getWeather: tool({
        description: "Get the current weather for a city.",
        inputSchema: z.object({
          city: z.string().describe("The city name, e.g. Pristina"),
        }),
        execute: async ({ city }) => {
          const r = await fetch(`https://api.weather.com/v1/${city}`);
          const data = await r.json();
          return { city, tempC: data.temp, conditions: data.summary };
        },
      }),
    },
  });

  return result.toDataStreamResponse();
}

The stopWhen: stepCountIs(5) guard limits how many tool-use rounds the model can take before the stream closes - cheap insurance against a model that loops. On the client, each tool call appears as a part in the message with type: 'tool-getWeather', carrying the input the model sent and the result your handler returned. Render them however you want - a card, an inline pill, a debug log.

Cancellation

Users cancel. The AI SDK plumbs an AbortSignal from the client all the way through to the provider. The useChat hook exposes a stop() function that aborts the in-flight fetch; the server-side streamText receives the abort and stops calling the provider; the provider stops generating tokens. The total round-trip from button click to billing-meter stopped is usually under 200ms.

// Inside any client component
const { stop, status } = useChat({ api: "/api/chat" });

return (
  <button onClick={stop} disabled={status !== "streaming"}>
    Cancel generation
  </button>
);

The trap is custom middleware. If you wrap streamText in your own auth or logging helper and forget to pass through the request's signal, abort silently does nothing - the model keeps generating, you keep paying. Always forward req.signal through every layer.

Error handling

Three error categories matter: provider errors (429, 500, 503), validation errors (the model returned something that did not match the schema), and your own errors (a tool handler threw). Each one has its own handling path in the SDK.

// app/api/chat/route.ts - with typed errors and retries
import { openai } from "@ai-sdk/openai";
import { streamText, APICallError } from "ai";

export const runtime = "nodejs";

async function withRetry<T>(fn: () => Promise<T>, attempts = 3): Promise<T> {
  let lastErr: unknown;
  for (let i = 0; i < attempts; i++) {
    try {
      return await fn();
    } catch (err) {
      lastErr = err;
      if (err instanceof APICallError && err.statusCode && err.statusCode >= 500) {
        await new Promise((r) => setTimeout(r, 2 ** i * 500));
        continue;
      }
      throw err;
    }
  }
  throw lastErr;
}

export async function POST(req: Request) {
  const { messages } = await req.json();

  try {
    const result = await withRetry(() =>
      Promise.resolve(streamText({ model: openai("gpt-5"), messages }))
    );

    return result.toDataStreamResponse({
      onError: (err) => {
        console.error("Stream error", err);
        return err instanceof Error ? err.message : "Unknown error";
      },
    });
  } catch (err) {
    return new Response(JSON.stringify({ error: String(err) }), {
      status: 500,
      headers: { "content-type": "application/json" },
    });
  }
}

Exponential backoff for 5xx and 429, fail fast on 4xx, and theonError callback on the data stream surfaces mid-stream errors as a first-class error part the client can render without breaking the message list. Retry budgets matter - three attempts with 500ms, 1s, 2s gives you about four seconds of added latency in the worst case, which is the right trade-off for most chat workloads.

Caching for cost

Token costs add up fast. OpenAI does automatic prefix caching for prompts over 1024 tokens - same prefix, byte-identical, hit ratio applies for ten minutes by default. Put your stable system prompt and few-shot examples at the very start of every request, never interpolate timestamps or per-user data into the prefix, and you get up to 50% off on the cached portion. The deeper economics of this are in my OpenAI API cost breakdown.

For Anthropic-backed routes, the cache is explicit. The AI SDK exposes it through providerOptions on each message. Mark the large stable blocks with cacheControl: type: 'ephemeral' and Claude deduplicates the prefix across requests at 90% off the input cost.

// Marking a system message for Anthropic prompt caching
import { anthropic } from "@ai-sdk/anthropic";
import { streamText } from "ai";

const result = streamText({
  model: anthropic("claude-opus-4-7"),
  messages: [
    {
      role: "system",
      content: longCorpusOfDocs,
      providerOptions: {
        anthropic: { cacheControl: { type: "ephemeral" } },
      },
    },
    { role: "user", content: userQuestion },
  ],
});

Resumable streams

Long generations create a UX problem: the user reloads the tab and loses the response in flight. The fix is to store stream chunks server-side and let the client reconnect with a stream ID. The AI SDK ships an experimental resume helper; the storage layer is yours to plug in. Redis or Vercel KV is the obvious choice.

// app/api/chat/route.ts - with resumable streams via Vercel KV
import { openai } from "@ai-sdk/openai";
import { streamText } from "ai";
import { kv } from "@vercel/kv";

export const runtime = "nodejs";

export async function POST(req: Request) {
  const { messages, streamId } = await req.json();

  const result = streamText({
    model: openai("gpt-5"),
    messages,
    onChunk: async ({ chunk }) => {
      if (chunk.type === "text-delta") {
        await kv.rpush(`stream:${streamId}`, chunk.textDelta);
        await kv.expire(`stream:${streamId}`, 3600);
      }
    },
    onFinish: async () => {
      await kv.set(`stream:${streamId}:done`, "1", { ex: 3600 });
    },
  });

  return result.toDataStreamResponse();
}

// GET handler for resumption
export async function GET(req: Request) {
  const streamId = new URL(req.url).searchParams.get("id")!;
  const chunks = await kv.lrange(`stream:${streamId}`, 0, -1);
  return new Response(chunks.join(""), {
    headers: { "content-type": "text/plain" },
  });
}

On the client, store the stream ID in sessionStorage on submit, and on mount check whether an unfinished stream exists - if so, hit the GET endpoint and replay the chunks before connecting the live stream. This is overkill for most chat UIs. Reach for it only when generations regularly cross 30 seconds and users actually leave the page.

Deploying to Vercel

Vercel is the cheapest path to a streaming AI endpoint that actually works under load. Three settings matter. The runtime - use Node, not Edge, unless you have a specific geo-latency requirement. The maxDuration - raise it to 300 seconds for any chat or long-form generation endpoint, since the default 10-second timeout will cut you off mid-stream. And Fluid Compute - turn it on at the project level so cold starts no longer dominate your p95 latency on Node functions.

// app/api/chat/route.ts - production runtime config
export const runtime = "nodejs";
export const maxDuration = 300;
export const dynamic = "force-dynamic";

The dynamic: force-dynamic export prevents Next.js from trying to statically render the route during build. Without it, builds occasionally try to evaluate the handler at build time and complain about missing environment variables. For deeper deployment patterns - geo routing, model failover, multi-region KV - the SaaS MVP tech stack post covers the supporting infrastructure that complements streaming endpoints.

If you are reaching for streaming because you are building a more agentic flow - multi-step reasoning, retrieval, branching tool use - the agentic RAG architecture post covers the orchestration patterns that sit on top of these primitives. The streaming layer is the same; the loop on top is what determines whether the system feels like a tool or a coworker.

The full official SDK docs live at ai-sdk.dev and the Next.js App Router reference at nextjs.org/docs. Both are kept current - bookmark them over any tutorial, including this one. The SDK changes fast enough that the version you ship against in six months will have new helpers worth adopting.

If you want a senior engineer who has shipped streaming AI features across multiple production Next.js apps, my AI integration practice covers exactly this scope. I work with teams worldwide, and you can also hire an AI developer in Kosovo directly. Same person behind OmniAPI and Caldra AI, both of which run on the exact streaming stack described in this post.

Frequently asked questions

Do I need the Vercel AI SDK to stream OpenAI in Next.js?

No, but you will reinvent it badly if you do not use it. The OpenAI SDK returns a native ReadableStream and you can pipe it to a Response object in a Route Handler with about 15 lines of code. The reason every serious Next.js app reaches for the AI SDK is the client-side useChat and useObject hooks - they handle message state, deltas, tool-call rendering, abort signals, and reconnection in a few props. If you are building a one-off endpoint, raw OpenAI is fine. If you are building a product, the SDK pays for itself the first afternoon.

Should I use Server Actions or Route Handlers for streaming?

Route Handlers if the client is the browser and you want a stable, cacheable, public-ish URL. Server Actions if the stream is part of a form submission or a server-driven UI flow, especially when you are also reading cookies, hitting your database, and rendering Server Components from the same response. Both can stream in 2026. Route Handlers are still the safer default for chat-style UIs because every client library on earth understands a POST that returns an event stream. Server Actions shine in tight integrations with React Server Components.

Edge runtime or Node runtime for streaming on Vercel?

Node, almost always, since Fluid Compute landed. Edge used to win on cold-start latency for chat endpoints, but Fluid Compute closed that gap and Node gives you full library compatibility - including any OpenAI SDK feature that depends on Node streams. The exception is geo-distributed personalization where TTFB matters more than feature breadth, in which case Edge still earns its complexity. For 90% of streaming AI endpoints I ship in 2026, the answer is the Node runtime with maxDuration raised to 300 seconds.

How do I cancel a stream mid-generation?

The AI SDK wires an AbortSignal end to end. On the client, useChat exposes a stop() function - call it from a button click. Under the hood it aborts the fetch, which closes the response stream, which signals streamText on the server, which calls abort() on the underlying OpenAI request. The model stops generating, you stop paying for tokens. The pattern that catches people is forgetting to wire the signal through any custom middleware you add - if your auth wrapper does not forward the signal, cancellation hangs.

How do I handle errors mid-stream?

Three layers. First, wrap streamText in a try/catch and use the onError callback to log and translate provider errors into typed errors the client can render. Second, use the AI SDK error part - the stream protocol has a first-class error chunk that useChat surfaces as an error state without breaking the message list. Third, add exponential backoff inside a thin retry wrapper around streamText for the recoverable error classes (429, 500, 503, network reset). The non-recoverable ones (auth, invalid request) should fail loudly the first time.

What is the simplest way to add prompt caching?

Anthropic-style cache markers on the system prompt and any large stable context blocks. The AI SDK exposes provider-specific cache control through the providerOptions field on each message - set type: ephemeral on the message you want cached, and the provider deduplicates the prefix across requests within the cache TTL. OpenAI does prefix caching automatically for prompts over 1024 tokens with no API change, so the trick there is to put your stable system prompt and few-shot examples at the very start of every request and keep them byte-identical across calls.

Can I resume a stream if the user reloads the page?

Yes, but it requires server-side state. The pattern is: write each stream chunk to Redis or Vercel KV keyed by a stream ID, return the stream ID to the client immediately, and expose a second endpoint that replays the cached chunks if the client reconnects with the same ID. The AI SDK ships an experimental_resume helper that does the wiring if you supply the storage adapter. For most chat UIs this is overkill - you only need it when generations run long enough that users actually leave and come back.

Why is my stream buffered instead of streamed in production?

Almost always a proxy. Vercel itself streams fine, but a Cloudflare in front of Vercel, an Nginx reverse proxy, or a corporate firewall will buffer event streams by default. Set the Content-Type to text/event-stream and add the X-Accel-Buffering: no header - the AI SDK toDataStreamResponse helper does this for you. The other common culprit is compression middleware that waits for the full body before flushing. Disable compression on streaming routes or use Content-Encoding: identity.