AI Email Automation: Build Your Own Triage Agent (2026)
By Ergini, Software & AI Developer in Pristina, Kosovo
TL;DR
Off-the-shelf AI email tools either over-send or under-help. This is the architecture for a triage-only agent that drafts but never sends, learns your tone, and runs on your own keys. With Gmail and Outlook API code.
I have been living inside an AI email triage agent for the better part of a year. It is the second half of Caldra AI, the calendar-and-email assistant I built for myself and a small group of operators who refused to ship their inbox through one more SaaS. Most days it saves me 45 to 70 minutes. It has never sent an email I did not approve. That second property is the whole point of this post.
Why off-the-shelf email AI is broken
The current crop of consumer email AI gets two things wrong, both consistently. The first is autonomy bias: the marketing demo shows the agent sending replies on its own. In practice, one wrong autonomous send - to the wrong recipient, with the wrong tone, at the wrong moment in a deal - destroys the trust you spent months building. The arithmetic is not symmetric. A hundred correctly sent replies do not erase one catastrophic one.
The second is the data leak. Every major email AI vendor pipes your message bodies through their own infrastructure and, depending on the contract, retains them for indeterminate periods. Some train on anonymized samples. Some log full bodies to their observability layer. Read the privacy policy of any consumer email AI carefully and the same pattern repeats: aggressive permission grants, vague retention, no real opt-out for sensitive content.
Add the third quieter problem - these tools are uncustomizable. You cannot teach Superhuman's AI that emails from your three biggest clients always need a 30-minute response window. You cannot tell Shortwave that anything containing a contract attachment must be flagged for legal review before drafting. The categories are fixed, the tone is the model's, and the workflow is whatever the vendor decided. For real productivity gains you need a system you control.
The triage-only architecture
Every email agent I have shipped to a client lands on the same shape: triage-only, drafts but never sends, human approves from the regular inbox. The agent is invisible until you open the draft. You see a ready-to-send reply with the cursor in the right place. You tap send, edit, or discard. That is the entire interaction loop.
This shape gets two things right that automated-send agents do not. First, trust compounds. After 200 approved drafts you trust the agent to handle the next category of mail; after 2000, you let it draft outside its original scope. The trust ladder is gradual and reversible. Compare that to an autosend agent: one bad send and trust resets to zero, possibly permanently.
Second, the productivity unlock is mostly classification and pre-writing, not the send action itself. The bottleneck in inbox work is reading, categorizing, and deciding the first words of a reply. The agent eliminates all three. The actual send takes a quarter of a second and is the only step where a human catch matters. It is the right place to keep the human, full stop. The pattern slots cleanly into the broader human-in-the-loop AI playbook - approval at the moment of irreversible action, autonomy everywhere upstream.
Stack
The 2026 default stack is narrow and proven. Five pieces, all with free or near-free starting tiers, all of which I have shipped to production.
| Component | Choice | Why |
|---|---|---|
| Mail provider API | Gmail API or Microsoft Graph | Native drafts, native labels, push notifications |
| LLM brain | Claude Sonnet 4.6 via Vercel AI SDK | Best tone-matching and structured-output reliability |
| Embeddings | OpenAI text-embedding-3-large or Voyage-3 | Cheap, accurate, multilingual for past-mail retrieval |
| Storage | Postgres with pgvector | One DB for metadata, audit log, and embeddings |
| Workflow runner | Vercel AI SDK + Next.js API routes | Server actions, streaming, structured outputs |
The Gmail-versus-Outlook decision is the only fork. The two APIs are structurally similar but the OAuth flows, scope strings, and push notification mechanics differ. Most production deployments end up supporting both - the abstractions below assume Gmail for clarity but the Outlook path is a 200-line adapter, not a rewrite.
Categories the agent should produce
Categorization is where most email agents are tuned wrong. Too few categories and the bucket is useless (everything ends up in "needs reply"). Too many and the model's accuracy collapses. The seven-category set below has been my default across every client deployment. Tune it to your domain but resist the urge to add categories - most pain comes from the wrong category split, not too few.
- Action - a concrete thing the user must do (sign a document, approve an invoice, review a PR).
- Reply needed - a real human reply is expected, with a deadline implied by context.
- FYI - informational, no reply needed but the user should be aware (status updates, copies on threads).
- Newsletter - automated content, marketing, subscriptions. Archive by default; surface for explicit interest.
- Spam - unsolicited, irrelevant, or malicious. Goes straight to the spam label.
- Receipts - transactional confirmations, invoices, order updates. Filed but searchable.
- Calendar - meeting requests, reschedules, cancellations. Routes to the scheduling pipeline.
The schema below is what I feed to the LLM as the structured-output contract. Strict types let the model express uncertainty (the confidence field), force a per-category urgency, and produce a one-sentence rationale that becomes the audit log entry. The rationale is the single most useful field in production - when a categorization looks wrong, the rationale tells you whether the model misread the message or whether your category definitions are vague.
// src/schema.ts
import { z } from "zod";
export const EmailTriageSchema = z.object({
category: z.enum([
"action",
"reply_needed",
"fyi",
"newsletter",
"spam",
"receipt",
"calendar",
]),
urgency: z.enum(["now", "today", "this_week", "whenever"]),
confidence: z.number().min(0).max(1),
rationale: z.string().max(200),
suggestedLabels: z.array(z.string()).max(3),
draftReply: z.string().optional(),
});
export type EmailTriage = z.infer<typeof EmailTriageSchema>;Step-by-step build (TypeScript)
What follows is the production architecture I use, broken into the six pieces that matter. Code is shortened for clarity but runnable. The full repo I keep for client baselines is around 2,500 lines including tests; what you see below is the spine.
Gmail API setup
Two pieces have to be right before any of the downstream code runs. OAuth scopes - request the minimum needed for what you do, no more. And Pub/Sub push notifications, which is how Gmail tells your server a new message arrived without you polling.
// OAuth scopes - request only what you use
const SCOPES = [
"https://www.googleapis.com/auth/gmail.readonly",
"https://www.googleapis.com/auth/gmail.compose", // drafts only, no send
"https://www.googleapis.com/auth/gmail.labels",
"https://www.googleapis.com/auth/gmail.modify", // for labels and archive
];
// One-time: set up a Pub/Sub topic and watch the inbox
import { google } from "googleapis";
export async function startWatch(oauth: any, topicName: string) {
const gmail = google.gmail({ version: "v1", auth: oauth });
return gmail.users.watch({
userId: "me",
requestBody: {
labelIds: ["INBOX"],
topicName, // projects/<project>/topics/<topic>
labelFilterBehavior: "INCLUDE",
},
});
}Note the absence of gmail.send. Without that scope, your code is structurally incapable of sending - a guarantee that survives any prompt injection, any bug, any rogue model output. The few-line decision to drop the scope is worth more than any runtime guardrail.
Webhook handler
Gmail pushes a Pub/Sub message every time the watched mailbox changes. The payload is tiny - just an email address and a history ID. Your handler decodes it, fetches the messages added since the last history ID, and queues each one for triage.
// app/api/gmail/webhook/route.ts
import { NextRequest, NextResponse } from "next/server";
import { google } from "googleapis";
import { getUserByEmail, updateHistoryId } from "@/lib/db";
import { triageEmail } from "@/lib/triage";
export async function POST(req: NextRequest) {
const body = await req.json();
const data = JSON.parse(Buffer.from(body.message.data, "base64").toString());
const user = await getUserByEmail(data.emailAddress);
if (!user) return NextResponse.json({ ok: true });
const gmail = google.gmail({ version: "v1", auth: user.oauthClient });
const history = await gmail.users.history.list({
userId: "me",
startHistoryId: user.lastHistoryId,
historyTypes: ["messageAdded"],
});
const newIds = (history.data.history ?? [])
.flatMap((h) => h.messagesAdded ?? [])
.map((m) => m.message!.id!);
for (const id of newIds) await triageEmail(user, id);
await updateHistoryId(user.id, data.historyId);
return NextResponse.json({ ok: true });
}Classification with structured outputs
The triage call uses structured outputs via the Vercel AI SDK's generateObject. Strict schema enforcement means you never parse free-text JSON, never get a malformed category, and never have to retry on shape errors. The prompt is short on purpose - long prompts hurt latency and rarely help classification accuracy.
// lib/triage.ts
import { generateObject } from "ai";
import { anthropic } from "@ai-sdk/anthropic";
import { EmailTriageSchema } from "@/schema";
import { fetchMessage, applyLabels, createDraft } from "./gmail";
import { matchTone } from "./tone";
import { writeAuditLog, isSensitive } from "./db";
export async function triageEmail(user: any, messageId: string) {
const msg = await fetchMessage(user, messageId);
if (await isSensitive(user, msg)) return; // label-aware filter
const fewShot = await matchTone(user, msg);
const { object } = await generateObject({
model: anthropic("claude-sonnet-4-6"),
schema: EmailTriageSchema,
system: `You triage email for ${user.name}. Categorize, score urgency, and if a reply is needed, draft one matching the user's tone from the examples provided. Never invent facts. If unsure, set confidence < 0.7 and skip the draft.`,
prompt: `Past replies in this user's voice:\n${fewShot}\n\nNew message:\nFrom: ${msg.from}\nSubject: ${msg.subject}\n\n${msg.body}`,
});
await applyLabels(user, messageId, [object.category, ...object.suggestedLabels]);
if (object.draftReply && object.confidence > 0.7) {
await createDraft(user, messageId, object.draftReply);
}
await writeAuditLog(user.id, messageId, object);
}Tone matching with embeddings
The trick that takes a generic email AI to one that sounds like you is embedding-based retrieval over your own sent mail. The first time a user connects, backfill the embedding index from their Sent folder. From then on, every outgoing draft updates the index. At draft time, embed the incoming message and pull the three most-similar past replies. Those become few-shot examples for the prompt.
// lib/tone.ts
import { embed } from "ai";
import { openai } from "@ai-sdk/openai";
import { sql } from "@/lib/db";
const MODEL = openai.embedding("text-embedding-3-large");
export async function indexSentMessage(
userId: string,
body: string,
context: string
) {
const { embedding } = await embed({ model: MODEL, value: context });
await sql`
insert into tone_index (user_id, body, context_embedding)
values (${userId}, ${body}, ${JSON.stringify(embedding)}::vector)
`;
}
export async function matchTone(user: any, msg: any) {
const ctx = `From: ${msg.from}\nSubject: ${msg.subject}\n\n${msg.body.slice(0, 800)}`;
const { embedding } = await embed({ model: MODEL, value: ctx });
const rows = await sql`
select body
from tone_index
where user_id = ${user.id}
order by context_embedding <=> ${JSON.stringify(embedding)}::vector
limit 3
`;
return rows.map((r: any, i: number) => `Example ${i + 1}:\n${r.body}`).join("\n\n");
}Embedding-based tone matching beats fine-tuning for this use case for three reasons: it updates instantly as your tone evolves, it is roughly one tenth the engineering and ongoing cost, and it gives you per-context tone (your tone with a client is different from your tone with your CTO - embeddings capture that automatically). The full pattern is what I cover in the RAG architecture tutorial - this is a small, focused variant of the same shape.
Draft generation (never send)
The draft creation call uses the Gmail drafts.create endpoint, which writes into the user's normal Drafts folder threaded against the original message. From the user's perspective, a draft just shows up under the original email - same as if they had clicked Reply and walked away. No new UI to learn, no new place to check.
// lib/gmail.ts
import { google } from "googleapis";
export async function createDraft(user: any, replyToId: string, body: string) {
const gmail = google.gmail({ version: "v1", auth: user.oauthClient });
const original = await gmail.users.messages.get({
userId: "me",
id: replyToId,
format: "metadata",
metadataHeaders: ["From", "Subject", "Message-ID", "References"],
});
const headers = original.data.payload!.headers!;
const get = (n: string) => headers.find((h) => h.name === n)?.value ?? "";
const raw = Buffer.from(
`To: ${get("From")}\r\nSubject: Re: ${get("Subject")}\r\nIn-Reply-To: ${get("Message-ID")}\r\nReferences: ${get("References")} ${get("Message-ID")}\r\n\r\n${body}`
).toString("base64url");
return gmail.users.drafts.create({
userId: "me",
requestBody: { message: { raw, threadId: original.data.threadId } },
});
}Approval UI
The simplest approval UI is the user's existing inbox - that is the whole point of drafting natively. For users who want a triage-stream view (especially helpful when first calibrating the agent), a thin queue UI on top of the audit log works well. A minimal React shell:
// app/(app)/queue/page.tsx
"use client";
import { useState, useEffect } from "react";
export default function ApprovalQueue() {
const [items, setItems] = useState<any[]>([]);
useEffect(() => {
fetch("/api/queue").then((r) => r.json()).then(setItems);
}, []);
return (
<div className="mx-auto max-w-3xl divide-y">
{items.map((it) => (
<article key={it.id} className="py-6">
<div className="text-xs uppercase opacity-60">
{it.category} · {it.urgency} · {Math.round(it.confidence * 100)}%
</div>
<h3 className="text-lg font-medium">{it.subject}</h3>
<p className="text-sm opacity-70">{it.from}</p>
<p className="mt-2 text-sm">{it.rationale}</p>
{it.draft && (
<pre className="mt-3 whitespace-pre-wrap rounded bg-black/5 p-3 text-sm">
{it.draft}
</pre>
)}
<div className="mt-3 flex gap-2">
<button onClick={() => fetch(`/api/queue/${it.id}/send`, { method: "POST" })}>
Send
</button>
<button onClick={() => fetch(`/api/queue/${it.id}/edit`)}>Edit</button>
<button onClick={() => fetch(`/api/queue/${it.id}/discard`, { method: "POST" })}>
Discard
</button>
</div>
</article>
))}
</div>
);
}Note that "Send" in this UI is the only thing in the entire system that triggers an outbound message - and it requires an explicit click. That single chokepoint is the architectural guarantee that no autonomous send ever happens.
Privacy and data residency
Email is the most sensitive corpus most people have. The three non-negotiables for a defensible architecture:
- Run on the user's OAuth keys, not a shared service account. Each user's Gmail or Graph credentials live encrypted in your database, scoped to their account only. A breach of one credential never touches another user's mail.
- BYO LLM. Offer a setting where the user supplies their own Anthropic or OpenAI key. Their email bodies then flow directly to their LLM contract, never touching your billing or your logs. For consumer pricing, default to a shared key with a zero-data- retention contract - Anthropic and OpenAI both offer ZDR for paying customers if you ask.
- Label-aware filtering. Users mark labels like "legal", "medical", "personal", "board" as sensitive. Any message touched by those labels bypasses the LLM entirely - no classification, no draft, no embedding, no audit log beyond "skipped: sensitive".
The 5 rules for trust
Trust in an email agent is built by adherence to a small number of explicit rules, each surfaced to the user during onboarding. These are the five I write into every client contract:
- Never send without explicit approval. The agent drafts. The user sends. The only code path that hits Gmail's send endpoint is the user clicking Send.
- Never train on user data. No shared fine-tunes, no aggregated embeddings across users, no log shipping of bodies to observability. Per-user data stays per-user.
- Label-aware filtering. Sensitive labels skip the pipeline entirely, by default. The user can override per-label, but the default is private.
- Full audit log. Every classification, every draft, every label change, every skipped message is recorded with timestamp and rationale. The user can review the full history at any time.
- Instant revoke. One click revokes OAuth, deletes embeddings, deletes audit logs, deletes drafts the agent created that the user did not send. Sub-five-minute total deletion.
Productivity metrics
The metrics worth tracking, and the rough numbers from a mid-volume knowledge worker (about 120 emails per day) after six weeks of agent use:
| Metric | Baseline | With agent | Delta |
|---|---|---|---|
| Time spent in inbox per day | ~110 min | ~55 min | 50% reduction |
| Draft acceptance rate (send as-is) | - | 62% | +1 min saved per accepted draft |
| Draft edit rate (small edits) | - | 28% | ~30s saved per edited draft |
| Draft discard rate | - | 10% | No value lost (user replies normally) |
| False categorization rate | - | 4.5% | Mostly newsletters miscoded as FYI |
| Median time-to-reply on action items | 3.2 hr | 1.1 hr | Pre-drafted replies surface faster |
Two numbers matter more than the rest. Draft acceptance - the percentage of drafts the user sends as-is - is the headline tone-match metric. Anything below 50% means your tone matching is off; over 70% and the model has started writing in your voice well enough that it becomes hard to tell which threads you replied to and which the agent drafted. False categorization rate is the trust metric: above 10% and users lose confidence; below 5% and they stop checking the categories and just trust the labels.
Multi-account setup
Most users connect both work and personal email. The mistake to avoid is sharing the tone index across them. Your tone with your spouse and your tone with your enterprise customer are different and the agent should never bleed one into the other.
The right model is per-account contexts: each connected account has its own OAuth credential, its own embedding namespace for tone matching, its own category configuration (work email needs project-related and vendor categories; personal email needs friends, family, and receipts). The LLM, the infrastructure, and the approval UI are shared. The data is strictly partitioned by account_id. Database queries always join onaccount_id - a missing join is the kind of bug that ends with a vendor reply quoting your wife's last text.
For users with three or more accounts, a quick toggle in the UI to switch active context helps. The agent runs the same pipeline for each, but the user sees one approval queue at a time. Cross-context signals (someone emailed your work address about a personal calendar item) get flagged for explicit user review rather than auto-routed.
Caldra AI case study
Caldra AI is what I built for myself before clients asked for it. Calendar plus email, both running the same triage-only architecture. Honest notes from running it as my primary inbox for ten months:
What worked. The drafts-only constraint compounded trust faster than I expected. Within three weeks I stopped editing most drafts. The tone matching via past-sent retrieval was the single highest-leverage decision - clients regularly cannot tell which replies I wrote and which started as agent drafts. The audit log ended up being more valuable than projected, both for debugging misclassifications and as a record I could share with a security- cautious client.
What did not work. The first version had ten categories. Accuracy was 78%. Collapsing to seven took it to 95%. I spent weeks trying to fine-tune draft style on my own corpus before embedding-based retrieval shipped - fine-tuning was wasted effort, the embedding approach beat it on every metric and updated in real time. Pub/Sub debugging is genuinely painful and I rewrote the webhook handler three times before it stopped silently dropping messages during history-ID gaps.
What I changed for clients. Stricter sensitive-label defaults (legal, medical, financial, board all pre-enabled). Per- client tone profiles instead of just per-account. A heavier audit log view because enterprise security teams want one. BYO-key support because two of the first three clients required it. None of this changed the core architecture - it is still the same six-piece pipeline.
Build vs Superhuman, Shortwave, Spike
The SaaS email AI space is mature enough that "just build it" is no longer the obvious answer for individuals. The honest comparison:
| Path | Time to ship | Cost | Best for |
|---|---|---|---|
| Superhuman | 5 min | $30/user/mo | Solo users, polished UX, accepts vendor data flow |
| Shortwave | 5 min | $25/user/mo | Heavy AI-first inbox use, Gmail only |
| Spike | 5 min | $10 to $20/user/mo | Chat-style inbox, team mailbox use |
| Custom triage agent | 2 to 4 weeks | $1 to $8/user/mo in LLM and infra | Privacy-first, custom categories, deep CRM ties |
The decision is rarely about productivity ceiling - all four can save the same minutes per day for an individual. It is about who owns the data and how custom the workflow needs to be. For a sales team with a specific lead-scoring flow, or a healthcare clinic with HIPAA needs, or a founder who treats inbox triage as a competitive moat, the custom build wins. For everyone else, Superhuman or Shortwave are excellent and you should stop reading and go install one. Some of the same architectural patterns appear in my AI scheduling assistant review - the build-vs-buy crossover lands in roughly the same place.
Cost per user per month
The LLM cost math is straightforward and the number is much lower than people expect. The breakdown for a mid-volume user (120 incoming emails per day, of which 60 generate drafts):
| Line item | Unit | Per user / month |
|---|---|---|
| Embeddings (incoming + sent) | ~5K embeddings, $0.13/M tokens | $0.20 |
| Classification + draft (Claude Sonnet 4.6) | ~3,600 calls, ~1.5K in / 0.4K out each | $3.80 |
| Postgres + pgvector (managed) | Shared tier, ~50MB / user | $0.50 |
| Pub/Sub + webhook infra | ~3,600 events | $0.05 |
| Total | - | ~$4.55 |
For light users (50 emails per day) the same math lands at $1.50 to $2.50. For heavy users (300+ emails per day) it climbs to $8 to $12. Prompt caching on the system prompt and the per-user tone-profile section cuts the LLM line item by another 40% if you have steady traffic - the caching tricks I cover in the OpenAI API cost post apply directly to Anthropic too.
At those numbers, the unit economics flip the usual SaaS math: $25/user/mo Superhuman pricing represents roughly $20/user/mo of margin on top of the LLM and infrastructure. For a team of 50+ that is $12K/year that walks out the door for a build that takes 2 to 4 weeks. For a solo user or a team of 5, the SaaS path still wins on time-to-value.
If you are scoping an email agent build and want a senior engineer who has shipped one to real users, my AI integration and AI agent development practices cover this exact scope. I work with teams worldwide, and you can also hire an AI developer in Kosovo directly. Same person who built Caldra AI, runs it every day, and has the audit logs to prove the drafts-only-never-sends discipline holds in production.
Frequently asked questions
What is AI email automation?
AI email automation uses a language model to read, classify, label, summarize, and draft replies to incoming email. In 2026 the useful variant is triage-only: the agent never sends on its own. It writes Gmail or Outlook drafts that you approve from your normal inbox. The pieces under the hood are a mail-provider API (Gmail or Microsoft Graph), a webhook that fires on every new message, an LLM that produces structured classifications and drafts, an embedding store of your past sent mail for tone matching, and a queue UI for approvals.
Is AI email automation safe to give my inbox access?
It can be, if the architecture is right. The three rules: the agent runs on your own OAuth keys and never sends without explicit approval; nothing trains on your data (no shared fine-tunes, no logging of message bodies to third parties); and sensitive labels (legal, medical, finance, anything you flag) are filtered out before the LLM ever sees them. If a vendor cannot give you a clear yes on all three, do not connect your inbox. If you build it yourself, you control all three by construction.
Does an email agent work with both Gmail and Outlook?
Yes - Google Workspace via the Gmail API and Microsoft 365 via Microsoft Graph. The architecture is identical: OAuth into the provider, subscribe to push notifications (Gmail Pub/Sub or Graph subscriptions), receive a webhook per change, fetch new messages, classify, draft. The provider-specific code is about 200 lines per side. Everything downstream - the LLM, the embedding store, the approval UI - is shared.
What does an AI email agent cost per user per month?
On a custom build with Claude Sonnet for drafts and OpenAI for embeddings, a heavy user (200 emails per day) costs roughly $4 to $8 in LLM spend. Light users (50 per day) land at $1 to $2. Add $0.50 to $1 for Postgres and webhook infrastructure. SaaS tools like Superhuman AI, Shortwave, and Spike charge $25 to $40 per user per month, so the build-it-yourself unit economics are strong - but you pay for the build in engineering time. Crossover is around 50 to 100 users.
Will the agent learn my tone?
If you build it right, yes - using retrieval rather than fine-tuning. The pattern: embed every message you have ever sent, store the embeddings in pgvector, and on each draft retrieve the three most-similar past replies as few-shot examples for the prompt. The model picks up your voice, your sign-off, your length, your formality. Fine-tuning sounds better in theory but is overkill - embeddings give you the same effect with one tenth the engineering cost and instant updates as your tone evolves.
Can the agent send replies automatically?
It should not. Every production email agent I have built lands on the same rule: drafts only, no autosend. The cost of one wrong autonomous send to the wrong recipient - leaked deal, fired customer, lawsuit risk - outweighs the time saved across hundreds of correct ones. Drafts that sit in your normal Gmail or Outlook draft list let you approve from any device with one tap. The autonomy unlock is faster classification and pre-written drafts, not removing the human.
How is this different from Superhuman or Shortwave?
Superhuman, Shortwave, and Spike are polished consumer products with strong AI features bolted onto their own inbox UI. They are great if you want a turnkey experience and do not mind running mail through a vendor. A custom triage agent wins when you need to keep mail on your own infrastructure, integrate deeply with internal tools (CRM, ticketing, calendar), build domain-specific classifiers (sales-specific or healthcare-specific categories), or run on your own LLM keys for cost and privacy reasons. The build path is 2 to 4 weeks; the buy path is 5 minutes.
How do I handle multiple email accounts?
Treat each account as a separate context with its own OAuth credential, its own embedding namespace for tone matching, and its own classifier configuration. Work email gets a different category set (project-related, internal, vendor, customer) than personal email (friends, family, receipts, newsletters). Share the LLM, the infrastructure, and the approval UI across accounts. Keep the data strictly partitioned - a sales draft must never retrieve few-shot examples from personal email or vice versa.