AI Content Moderation: OpenAI Moderation API in Production
By Ergini, Software & AI Developer in Pristina, Kosovo
TL;DR
The OpenAI Moderation API is free and good. Then you hit your edge cases. This shows the input and output moderation pattern, threshold tuning, custom classifier layering, and the human queue for ambiguous cases.
Every platform that lets users post anything or that generates anything with an LLM is now a content moderation problem. The good news is that the baseline tooling got dramatically better in 2026 - the OpenAI Moderation API is free, fast, and covers eleven categories across text and images. The bad news is that the baseline is not enough on its own. This post is the three-layer architecture I ship for clients who need moderation that holds up at production traffic without becoming the single point of failure for their platform.
Why moderation matters in 2026
Three forces converged this year and turned content moderation from a nice-to-have into table stakes. First, user-generated content is bigger than it has ever been - comments, DMs, profiles, marketplace listings, livestream chat. Every one of those surfaces is a vector for hate speech, harassment, spam, and worse. The volume alone makes full human review economically impossible above a few thousand items a day.
Second, AI-generated content created an entirely new failure mode. Any product that ships an LLM feature - chatbots, draft assistants, image generation, voice agents - now has to worry about the model itself producing problematic output. Jailbreaks slip through. Models hallucinate instructions for things they should refuse. A fine-tuned model drifts in production and starts generating content the pre-launch eval never caught. Output moderation is no longer optional when your product is the one generating the content.
Third, regulation arrived. The EU Digital Services Act is now fully enforced with meaningful fines for platforms that fail their notice-and-action obligations. KOSA is moving through US Congress with bipartisan support and most legal teams are treating it as inevitable. Age-gate requirements in the UK, France, and several US states require platforms to demonstrate that minors are not exposed to specific category sets. None of this is solved by a single moderation call - it requires a documented pipeline with an audit trail and a human in the loop on edge cases.
The three-layer architecture
The shape that works in production is three layers stacked in sequence. Layer one is input moderation - every inbound user message or every prompt heading to your LLM gets a moderation check before it reaches the model. Layer two is output moderation - every LLM response gets a moderation check before it reaches the user. Layer three is the human review queue - items that score in the ambiguous middle, get flagged by the custom classifier, or are reported by users land in a queue that human reviewers work through with a defined SLA.
Each layer has a budget and a fallback. Input moderation runs in under 200ms or the request proceeds with a flag for retroactive review. Output moderation runs in parallel with the LLM stream and kills the stream if a high-severity category fires. The human queue has a 4-hour SLA for ambiguous items and a 30-minute SLA for high-severity escalations. The architecture is designed to fail toward a human, not toward a confident wrong block, which is the same design principle that drives my human-in-the-loop AI work across every client deployment.
OpenAI Moderation API - what it actually covers
The omni-moderation-latest model is the 2026 default for layer one and layer two. It is free, returns category scores between 0 and 1 for eleven categories, and the response includes both per-category probabilities and OpenAI's default boolean flags. A typical call looks like this.
// src/moderate.ts
import OpenAI from "openai";
const openai = new OpenAI();
export async function moderate(input: string | string[]) {
const res = await openai.moderations.create({
model: "omni-moderation-latest",
input,
});
return res.results.map((r) => ({
flagged: r.flagged,
categories: r.categories,
scores: r.category_scores,
}));
}The response gives you everything you need to make a routing decision. The boolean flagged field is OpenAI's default threshold call, which is conservative and tuned for a generic platform. The category_scores object is where the real signal lives - every category gets a raw probability score that you compare against your own thresholds. In practice nobody ships on the default booleans; you ship on tuned per-category thresholds.
Categories and thresholds
The eleven categories the omni model returns are: hate, hate/threatening, harassment, harassment/threatening, self-harm, self-harm/intent, self-harm/instructions, sexual, sexual/minors, violence, and violence/graphic. The right threshold for each category is a function of your platform's tolerance for false positives versus false negatives. There is no universal answer.
The process I run with clients: sample 500 real items from your production traffic, label them by hand against each category, then build a confusion matrix at thresholds from 0.1 to 0.9 in 0.05 increments. Plot the precision-recall curve per category and pick the threshold where the curve sits at your team's tolerance. Categories with high cost-of-miss (sexual/minors on any consumer platform, self-harm/intent on a teen-facing product) get aggressive thresholds around 0.2 to 0.3. Categories with high cost-of-false-positive (harassment on a debate forum where heated exchange is part of the product) sit closer to 0.6 or 0.7.
A reasonable starting point for a generic consumer product is: sexual/minors at 0.2, self-harm/intent at 0.3, hate/threatening at 0.4, violence/graphic at 0.5, harassment at 0.6, sexual at 0.5, hate at 0.6, violence at 0.7. Then tune from real traffic. Re-run the analysis every quarter because your traffic distribution will shift as your user base grows.
Layer 1: input moderation
Input moderation runs before any LLM call or content publish. The pattern is a simple wrapper that checks the moderation result against your per-category thresholds and either passes, blocks, or flags for review.
// src/input-mod.ts
import { moderate } from "./moderate.js";
import { logFlag } from "./audit.js";
const THRESHOLDS = {
"sexual/minors": 0.2,
"self-harm/intent": 0.3,
"hate/threatening": 0.4,
"violence/graphic": 0.5,
harassment: 0.6,
sexual: 0.5,
hate: 0.6,
violence: 0.7,
} as const;
export async function checkInput(text: string, userId: string) {
const [result] = await moderate(text);
const triggered: string[] = [];
for (const [cat, threshold] of Object.entries(THRESHOLDS)) {
const score = result.scores[cat as keyof typeof result.scores] ?? 0;
if (score >= threshold) triggered.push(cat);
}
if (triggered.length === 0) return { action: "allow" as const };
await logFlag({ userId, text, triggered, scores: result.scores });
const hardBlock = triggered.some((c) =>
["sexual/minors", "self-harm/intent"].includes(c)
);
return hardBlock
? { action: "block" as const, triggered }
: { action: "review" as const, triggered };
}Note the three-way return: allow, block, or review. Hard-block categories (sexual/minors, self-harm/intent) never reach a reviewer because the cost of a false negative is too high. Everything else gets flagged for human review without blocking the user, because blocking aggressively on the soft categories trains users to find workarounds and silently kills your funnel.
Layer 2: output moderation
Output moderation runs on every LLM response. The pattern depends on whether you are streaming or not. For non-streamed responses, moderate after generation and before delivery. For streamed responses, buffer the stream chunks, moderate every 200 tokens, and kill the stream if a category fires above its threshold.
// src/output-mod.ts
import { moderate } from "./moderate.js";
const STREAM_THRESHOLDS = {
"sexual/minors": 0.15,
"self-harm/instructions": 0.2,
"hate/threatening": 0.3,
"violence/graphic": 0.4,
} as const;
export async function checkOutput(text: string) {
const [result] = await moderate(text);
for (const [cat, threshold] of Object.entries(STREAM_THRESHOLDS)) {
const score = result.scores[cat as keyof typeof result.scores] ?? 0;
if (score >= threshold) {
return { ok: false as const, category: cat, score };
}
}
return { ok: true as const };
}
export async function* moderateStream(stream: AsyncIterable<string>) {
let buffer = "";
for await (const chunk of stream) {
buffer += chunk;
yield chunk;
if (buffer.length >= 200) {
const check = await checkOutput(buffer);
if (!check.ok) {
throw new Error(`Stream killed: ${check.category} at ${check.score}`);
}
buffer = buffer.slice(-50);
}
}
}Output moderation is also the layer that catches jailbreaks the input filter missed. A jailbreak by definition is a prompt the model engages with that it should have refused; output moderation on the response is your second line of defence. This is the same defence-in-depth thinking I cover in detail in my prompt injection prevention post - input and output moderation are two of the eight patterns in that defence stack.
When OpenAI Moderation is not enough
The free baseline handles roughly 80% of moderation needs on a generic consumer platform. The remaining 20% breaks into three categories that force you into a custom classifier on top.
Brand-specific policy. Competitor mentions in your marketplace, off-topic content on a focused community, referral spam patterns specific to your product. The OpenAI model has no idea any of these are policy violations because they are not policy violations on a generic platform.
Regulated industries. Financial services need to catch unauthorized investment advice. Healthcare needs to catch unverified medical claims. Crypto exchanges need to catch KYC bypass discussion. None of these map to the eleven generic categories and all of them carry real regulatory risk if missed.
Multilingual content. The omni model is strong in English and major European languages but weaker in low-resource languages, regional dialects, and code-switched content. If your platform serves Indonesian, Vietnamese, Albanian, Tagalog, or similar markets, baseline coverage is materially lower and a custom layer becomes mandatory.
Custom classifier with structured outputs
The cheapest custom layer is a structured-output call to a small model with a tight enum of your platform-specific categories. This runs in parallel with the OpenAI Moderation call so the additional latency is hidden, and the cost is roughly $0.0005 per item on GPT-5-mini.
// src/custom-classify.ts
import { generateObject } from "ai";
import { openai } from "@ai-sdk/openai";
import { z } from "zod";
const schema = z.object({
categories: z.array(z.enum([
"competitor_mention",
"off_topic",
"referral_spam",
"financial_advice",
"medical_claim",
"kyc_bypass",
"clean",
])),
confidence: z.number().min(0).max(1),
reason: z.string(),
});
const SYSTEM = `You are a content moderation classifier for Acme.
Return the categories that apply to this content from the enum.
Use "clean" only if no other category applies.
Be precise - do not flag content that is merely adjacent to a category.`;
export async function customClassify(text: string) {
const { object } = await generateObject({
model: openai("gpt-5-mini"),
schema,
system: SYSTEM,
prompt: text,
});
return object;
}For higher volume, distill this classifier into a fine-tuned smaller model - either a fine-tuned GPT-4.1-mini or an open-source BERT variant served on your own infrastructure. The distillation path drops cost per item from $0.0005 to roughly $0.00005 at the cost of a one-time training run. The pattern is the same one I use in my OpenAI structured outputs guide - strict schema, single enum field, refusal-aware system prompt.
The human review queue
The queue is a database table of flagged items plus a thin internal app for reviewers. The schema looks like: item ID, content snapshot, triggered categories, scores, user ID, severity, SLA timestamp, reviewer ID, decision, decision reason, decision timestamp. That is the entire data model and it is also the audit trail you need for DSA and KOSA compliance.
SLA matters more than throughput. Set a 4-hour SLA for ambiguous items in soft categories (harassment, off-topic, off-brand). Set a 30-minute SLA for high-severity escalations (sexual/minors, self-harm/intent, hate/threatening). Prioritize the queue by severity multiplied by user impact - a flagged comment with 10 views can wait, a flagged livestream message in front of 10K viewers jumps to the front of the queue.
Build the reviewer UI for speed. Keyboard shortcuts for approve/reject/escalate, one item at a time, no modal dialogs, no confirmation steps for low-severity decisions. The reviewer is burning 5 to 15 seconds per item at peak; every UI friction point gets multiplied by tens of thousands of decisions per week. Log every decision because reviewer decisions are training data for the custom classifier in the next active-learning cycle.
Image and video moderation
OpenAI's omni-moderation model handles images alongside text in a single call, which is the simplest path for product surfaces that mix both. Pass an image URL or base64 payload in the same request shape and you get back the same eleven-category score object. For most products this is the default and you do not need a second image-specific pipeline.
When you need deeper image features - region-specific NSFW heuristics, OCR of in-image text for hate symbols, object detection for weapons or drugs - the mature alternatives are Google Cloud Vision SafeSearch and AWS Rekognition Content Moderation. Both are battle-tested, both run around $1.50 per 1K images, and both return richer category trees than OpenAI's generic set.
Video moderation usually means sampling frames at 1 to 3 frames per second and passing each frame through the image pipeline. The cost math scales linearly with sampling rate; the harder problem is the human review UX for video because reviewers need timestamp jumping, clip context, and the ability to scrub. Build that UI deliberately or queue review time grinds to a halt.
For livestream workloads you also need an audio path - speech-to-text on the live feed with the transcript moderated chunk by chunk. Whisper handles the transcription side cheaply, the text moderation reuses your existing layer-one pipeline, and the decision loop has to be tight enough to mute or cut the stream within seconds. This is also where you most need a soft warning state, because false positives that cut a creator mid-stream destroy trust faster than almost any other product failure.
Cost math
OpenAI Moderation is free, which means layer one and layer two cost nothing at any volume. The custom classifier on layer three runs roughly $0.0005 per item on GPT-5-mini or $0.00005 per item after distillation. The human review layer is the only meaningful operating cost - a reviewer at $25 per hour processing 250 items per hour costs $0.10 per reviewed item, and you typically route 2 to 5% of all content to review.
Worked example for a platform doing 1M user posts per month: 1M OpenAI Moderation calls at $0, 1M custom classifier calls at $500, 30K items routed to review at $3K reviewer time, totalling $3.5K per month for fully-moderated UGC at 1M items. Compare to a single regulatory fine, a single PR incident, or the unit economics of the deflection you get from blocking obvious spam upstream, and the moderation pipeline pays for itself before month two. The full per-call economics for the LLM portion lives in my OpenAI API cost post.
Compliance - DSA, KOSA, age gates
The EU Digital Services Act applies to any platform serving EU users above a low threshold and imposes notice-and-action obligations, transparency reporting, and risk assessment requirements. Practically, this means you need a documented moderation pipeline, an audit log of every decision, a user-facing reporting mechanism, and a yearly transparency report. The three-layer architecture above gives you the first three for free if you build the queue with proper logging from day one.
KOSA (the US Kids Online Safety Act) is moving through Congress and is widely expected to pass in some form. The relevant provisions for moderation are the duty-of-care obligations for platforms with under-17 users on specific category sets: self-harm/intent, sexual/minors, harassment, and content promoting eating disorders. Tune your thresholds aggressively on those categories if your user base skews young, and document the rationale because the legal team will need it.
Age gate requirements in the UK Online Safety Act, the French SREN law, and several US state laws all require that platforms demonstrate minors are not exposed to specific categories. The moderation pipeline is one piece; the other piece is age verification at signup or at first exposure to the gated content. That is a separate product surface but it lives in the same risk domain and should be scoped together.
Anti-patterns
Three anti-patterns show up on almost every moderation audit I run. None of them are novel; all of them keep happening because the failure mode is invisible until it is not.
AI-only with no human in the loop. The team trusts the classifier scores and skips the human review queue entirely. Works until the day the model misclassifies something serious in public and there is no audit trail showing a human reviewed the edge case. The fix is structural - even at 99% precision you need a human path for the 1% because that 1% is where regulators and journalists look.
Thresholds set too strict. The team panics about false negatives and sets every threshold at 0.3. Legitimate users get blocked, the support team is buried in appeals, and the funnel quietly bleeds out. The fix is the confusion-matrix tuning process above - set thresholds from data, not from fear.
No audit trail. The team runs moderation but never logs which items triggered which categories at which thresholds. When a regulator or a journalist asks how a specific item was handled, the answer is a shrug. The fix is to log every flag with full context from day one - item snapshot, scores, threshold, decision, reviewer, timestamp - because retrofitting an audit trail after the fact is significantly harder than building it in.
Two more honourable mentions show up often enough to call out. Skipping output moderation entirely is the most common gap I find on AI feature audits - the team protects the input pipeline and forgets the model itself can produce problematic content. And running moderation synchronously in the request path with no timeout or fallback means a moderation provider outage takes down the entire publish flow. The fix is a 200ms budget with a flag-for-review fallback if the call times out; never let an external dependency block the user experience completely.
If you are scoping a moderation pipeline and want a senior engineer who has shipped this exact three-layer architecture, my AI integration and AI workflow automation practices cover exactly this scope. I work with teams worldwide and you can also hire an AI developer in Kosovo directly. Same person who built Caldra AI and Lindi AI.
Frequently asked questions
What is AI content moderation?
AI content moderation is the automated classification of user-generated or AI-generated content into policy categories - hate, sexual, violence, self-harm, harassment, and so on - so a platform can block, flag, or queue the content for human review. The 2026 reference stack is a three-layer pipeline: the free OpenAI Moderation API as a fast baseline, a custom classifier for brand-specific or regulated categories, and a human queue for the ambiguous middle. Naive single-layer moderation misses roughly 20% of edge cases; the three-layer pattern brings false-negative rate under 3% on the workloads I have shipped.
Is the OpenAI Moderation API free?
Yes. The OpenAI Moderation API has been free since launch and remains free in 2026, including the omni-moderation model that classifies text and images in a single call. There is no per-token cost and no rate-limit tier you have to pay for; the only operational cost is your own latency budget and infrastructure. That is the main reason it works as a default first layer - every inbound and outbound message can be moderated without a budget conversation, and you spend custom-classifier money only on the items that pass or score in the ambiguous middle.
What categories does the OpenAI Moderation API cover?
The omni-moderation-latest model returns scores between 0 and 1 for eleven categories: hate, hate/threatening, harassment, harassment/threatening, self-harm, self-harm/intent, self-harm/instructions, sexual, sexual/minors, violence, and violence/graphic. Each category also has a boolean flag based on OpenAI's default thresholds. Image moderation covers the same category set for the visual modality. What it does not cover is brand-specific policy (competitor mentions, off-topic content), regulated-industry terms (financial advice, medical claims), or non-English nuance for low-resource languages - those are exactly the gaps a custom classifier layer fills.
How do I tune moderation thresholds?
Label 500 real items from your own traffic across each category - the OpenAI defaults are tuned for a generic platform and almost never match your tolerance. For each category, build a confusion matrix at thresholds from 0.1 to 0.9 in 0.05 increments and pick the threshold where the precision-recall curve sits at your team's tolerance for false positives versus false negatives. Categories with low cost-of-miss (self-harm/intent on a teen platform) get aggressive thresholds around 0.3; categories with high cost-of-false-positive (harassment on a debate forum) sit closer to 0.7. Re-evaluate every quarter because your traffic distribution shifts.
Should I moderate LLM outputs as well as inputs?
Yes. Input moderation blocks bad prompts before they reach the model; output moderation catches the cases where the model itself produces problematic content - jailbreaks that slipped past, hallucinated harmful instructions, or unexpected drift on a fine-tuned model. Output moderation is the layer that protects you when input moderation fails, and the cost is the same free call to OpenAI Moderation. Skipping output moderation is one of the most common production gaps I find on AI feature audits because the input pipeline gets attention and the output pipeline gets forgotten.
When is the OpenAI Moderation API not enough?
Three cases force you into a custom classifier on top. First, brand-specific policy - competitor mentions, off-topic content, spam patterns specific to your product. Second, regulated industries where the policy categories do not match the generic ones - financial-advice claims, unverified medical claims, KYC bypass discussion. Third, multilingual content in low-resource languages where OpenAI's coverage is weaker than in English. The pattern is to keep the free baseline as the first layer and add a small custom classifier - distilled into a fast model like GPT-5-mini or a fine-tuned BERT - for the categories the baseline misses.
How do I build the human review queue?
The queue is just a database table of flagged items plus a thin internal app that lets reviewers approve, reject, or escalate. The non-obvious pieces are SLA, prioritization, and audit trail. Set a 4-hour SLA for ambiguous items and 30 minutes for high-severity escalations like self-harm/intent or sexual/minors. Prioritize by category severity multiplied by user impact (a flagged post seen by 10K people jumps the queue). Log every decision with reviewer ID, timestamp, item snapshot, and reason - both for compliance (DSA, KOSA) and for the active-learning loop where reviewer decisions feed back into the custom classifier as training data.
What about image and video moderation?
OpenAI's omni-moderation model handles images alongside text in a single call, which is the simplest path for product surfaces that mix both. For deeper image features (object detection, OCR of in-image text, NSFW heuristics by region), Google Cloud Vision SafeSearch and AWS Rekognition Content Moderation are mature and cheap at $1.50 per 1K images. Video moderation usually means sampling frames at 1 to 3 fps and passing them through the same pipeline. The cost math is straightforward once you decide how aggressively to sample; the harder problem is the human-review UX for video, which needs timestamp jumping and clip context.