Build an AI Voice Agent with Twilio and ElevenLabs (2026)
By Ergini, Software & AI Developer in Pristina, Kosovo
TL;DR
A 24/7 AI voice agent on a real phone number is now a weekend project. This walks through the full build - Twilio media streams, ElevenLabs streaming TTS, sub-800ms latency, interruption handling, and the graceful human handoff that keeps it useful.
A 24/7 AI voice agent on a real phone number used to take a quarter of engineering and a six-figure budget. In 2026 it is a focused weekend project, and the same stack that powers VC Automation, my outbound voice agent for venture capital outreach, is the one this post walks through. By the end you will have a working architecture, real code for each piece, a latency budget that holds at sub-800ms, and the production gotchas that bite every team on their first ship.
What "voice agent" means in 2026
A voice agent is a real-time, bidirectional audio system that holds a structured conversation over a phone line or a WebRTC stream. The bar is no longer "answers questions like a chatbot but with audio." It is sub-1-second response latency, mid-sentence interruption handling, durable conversation state, tool calls that touch real systems (calendars, CRMs, payment APIs), and a graceful escalation path to a human. Anything less and callers hang up inside 30 seconds.
The use cases that actually justify the build are narrow and worth a lot. Inbound: appointment booking for healthcare, dental, and trades; tier-one support deflection for SaaS; order status for ecommerce. Outbound: lead qualification, no-show recovery, payment reminders, survey campaigns. The unifying property is high call volume against a structured workflow where the marginal call is cheap to handle if it does not need a human. Anything open-ended (therapy, complex sales, anything legal) still belongs with a person on the line.
The stack - 4 components
Every production voice agent decomposes into the same four moving parts. The provider choice for each component shifts pricing, latency, and the corner cases you will hit, but the shape of the system is the same:
| Component | What it does | Top picks in 2026 | Typical latency |
|---|---|---|---|
| Telephony | PSTN connectivity, media streaming, DTMF, call control | Twilio, Vonage, Telnyx | Network only (50 to 150 ms) |
| Speech-to-text | Streaming transcription with partial hypotheses | Deepgram Nova-3, OpenAI Whisper, Azure | 100 to 250 ms (final word to transcript) |
| LLM brain | Decides the next utterance, calls tools, manages conversation state | Claude Sonnet 4.6, GPT-5-mini, Llama 3.3 70B | 200 to 500 ms time-to-first-token |
| Text-to-speech | Streams synthesized audio back to the caller | ElevenLabs Conversational v2, Cartesia, OpenAI TTS | 200 to 400 ms to first audio chunk |
The default 2026 stack I reach for is Twilio + Deepgram + Claude Sonnet 4.6 + ElevenLabs. Twilio because PSTN coverage is unmatched and the Media Streams API is stable. Deepgram because Nova-3 holds the best accuracy-versus-latency tradeoff for streaming. Claude Sonnet 4.6 for instruction-following and tool-use fidelity (see my Claude vs ChatGPT breakdown). ElevenLabs because the voice quality matters more than any other surface - callers tolerate slow agents, they hang up on robotic ones.
Architecture - how audio flows
The shape of the system is a duplex pipe. Audio flows from the caller through telephony into your server, gets transcribed, fed to the LLM, and the model's output gets synthesized back into audio that flows the reverse path to the caller. Every stage streams. Nothing waits for a turn to complete.
- Inbound call hits a Twilio phone number. Twilio requests TwiML from your server.
- TwiML response opens a Media Stream (WebSocket) to your server. Twilio starts forwarding 20ms audio frames in mulaw/8kHz.
- Your server resamples mulaw 8kHz to linear16 16kHz and forwards to Deepgram's streaming endpoint.
- Deepgram emits partial transcripts and a final transcript per utterance.
- LLM call fires on each final transcript with the conversation history. Streams tokens back as they arrive.
- Sentence buffer watches the token stream and emits full sentences as soon as a terminal punctuation lands.
- ElevenLabs streaming TTS receives each sentence and streams PCM audio chunks back.
- Your server resamples PCM 16kHz to mulaw 8kHz and base64-encodes 20ms chunks into Twilio Media Stream messages.
- Twilio plays the audio to the caller. The loop runs until the call ends or escalates.
Two things in this loop are non-negotiable. First, every stage uses streaming. Batch APIs add 500ms to 2s of buffering and break the latency budget. Second, the LLM stream feeds the TTS stream sentence-by-sentence - never wait for the full LLM completion before starting TTS, or you lose 1 to 2 seconds on every turn.
Latency math - why sub-800ms matters
Conversation analysis from natural human dialogue puts comfortable turn-taking at 200ms to 500ms of silence. Past about 700ms callers start to feel hesitation. Past 1.5s they assume the line dropped and start talking - which collides with your TTS playback. The target for a voice agent that feels "real" is total response latency (last word from user to first audio out) under 800ms.
| Stage | Budget | Notes |
|---|---|---|
| End-of-utterance detection | 100 to 200 ms | Deepgram endpointing config, tune for your domain |
| STT final transcript | 50 to 100 ms | Already streamed during speech, finalization is fast |
| LLM time-to-first-token | 200 to 500 ms | Claude Sonnet 4.6 / GPT-5-mini, short prompt |
| First sentence buffered | 100 to 200 ms | Few tokens until terminal punctuation |
| TTS first audio chunk | 200 to 400 ms | ElevenLabs streaming endpoint, low-latency model |
| Total perceptible | 650 to 900 ms | Audible response in caller's ear |
Two tricks bring the total down further. Speculative TTS: when the LLM's first 20 tokens look like a sentence committed to a direction (greeting, acknowledgement, confirmation), start TTS before the sentence is fully streamed. Tool-call hiding: if a tool call is going to take 500ms+, emit a filler utterance ("Let me check that for you") before the tool fires so the caller never hears silence.
Step-by-step build (Node.js + Twilio)
The rest of the post walks through the actual code. Stack: Node 20, Twilio, Deepgram, the Vercel AI SDK with Anthropic, and ElevenLabs. Everything below is shortened for clarity but it is real, runnable code. Wire the pieces together and you have a working voice agent.
Project setup
Spin up a Node project with Express and a WebSocket server. The WebSocket handles the Twilio Media Stream, and Express serves the TwiML webhook Twilio hits when a call comes in.
// package.json
{
"name": "voice-agent",
"type": "module",
"dependencies": {
"@ai-sdk/anthropic": "^1.0.5",
"ai": "^4.0.20",
"@deepgram/sdk": "^3.9.0",
"elevenlabs": "^1.50.0",
"express": "^4.21.2",
"ws": "^8.18.0",
"twilio": "^5.4.0",
"zod": "^3.24.1"
}
}
// .env
TWILIO_ACCOUNT_SID=AC...
TWILIO_AUTH_TOKEN=...
DEEPGRAM_API_KEY=...
ANTHROPIC_API_KEY=sk-ant-...
ELEVENLABS_API_KEY=...
PUBLIC_HOST=voice.yourdomain.comTwilio webhook handler
When a call hits your Twilio number, Twilio POSTs to your webhook and expects TwiML in return. The TwiML below opens a bidirectional Media Stream to your WebSocket server. The <Connect> verb keeps the call alive for the duration of the stream.
// src/server.ts
import express from "express";
import { WebSocketServer } from "ws";
import { handleMediaStream } from "./media-stream.js";
const app = express();
app.use(express.urlencoded({ extended: false }));
app.post("/twilio/voice", (req, res) => {
const host = process.env.PUBLIC_HOST!;
const twiml = `<?xml version="1.0" encoding="UTF-8"?>
<Response>
<Say voice="Polly.Joanna">Connecting you to your AI assistant. This call is recorded.</Say>
<Connect>
<Stream url="wss://${host}/media" />
</Connect>
</Response>`;
res.type("text/xml").send(twiml);
});
const server = app.listen(3000);
const wss = new WebSocketServer({ server, path: "/media" });
wss.on("connection", handleMediaStream);Three things to note. The <Say> block discloses recording - many jurisdictions require this and it is cheaper than a regulatory complaint. The <Stream> URL must be WSS over a public hostname (ngrok in dev, your domain in prod). And the WebSocket path must match what you return in the TwiML - Twilio will not retry on mismatch.
Media stream proxy
Twilio sends 20ms frames of mulaw-encoded 8kHz audio over the WebSocket as base64. Deepgram and most STT providers want linear16 PCM at 16kHz. The proxy below decodes inbound, resamples up, and forwards to the STT session. Outbound runs the same in reverse.
// src/media-stream.ts
import { WebSocket } from "ws";
import { mulawToLinear16, upsample8to16 } from "./audio.js";
import { startDeepgram } from "./stt.js";
import { startAgent } from "./agent.js";
export async function handleMediaStream(ws: WebSocket) {
let streamSid: string | null = null;
const agent = await startAgent(ws);
const deepgram = startDeepgram((transcript, isFinal) => {
if (isFinal) agent.onUserUtterance(transcript);
else agent.onPartial(transcript);
});
ws.on("message", (raw) => {
const msg = JSON.parse(raw.toString());
if (msg.event === "start") {
streamSid = msg.start.streamSid;
agent.setStreamSid(streamSid);
} else if (msg.event === "media") {
const mulaw = Buffer.from(msg.media.payload, "base64");
const pcm8k = mulawToLinear16(mulaw);
const pcm16k = upsample8to16(pcm8k);
deepgram.send(pcm16k);
} else if (msg.event === "stop") {
deepgram.close();
agent.close();
}
});
}Deepgram STT streaming
Deepgram's Nova-3 model with streaming endpointing returns partial hypotheses every 100 to 200ms and a final transcript when speech ends. Tune endpointing to your domain - shorter values mean snappier turn-taking but more accidental cut-offs on slow speakers. 300ms is a sane default for English.
// src/stt.ts
import { createClient, LiveTranscriptionEvents } from "@deepgram/sdk";
const dg = createClient(process.env.DEEPGRAM_API_KEY!);
export function startDeepgram(
onTranscript: (text: string, isFinal: boolean) => void
) {
const connection = dg.listen.live({
model: "nova-3",
language: "en-US",
encoding: "linear16",
sample_rate: 16000,
interim_results: true,
endpointing: 300,
smart_format: true,
vad_events: true,
});
connection.on(LiveTranscriptionEvents.Transcript, (data) => {
const transcript = data.channel.alternatives[0]?.transcript ?? "";
if (!transcript) return;
onTranscript(transcript, data.is_final ?? false);
});
return {
send: (pcm: Buffer) => connection.send(pcm),
close: () => connection.finish(),
};
}LLM brain with tool calling
The agent loop holds the conversation state, fires the LLM on each final user utterance, and streams tokens out to a sentence buffer that triggers TTS. Tools are declared with Zod schemas - the model can calllookupAvailability, bookMeeting, or transferToHuman. The system prompt is short on purpose: long prompts slow first-token latency.
// src/agent.ts
import { streamText, tool } from "ai";
import { anthropic } from "@ai-sdk/anthropic";
import { z } from "zod";
import { speak, stopSpeaking } from "./tts.js";
import { transferCall } from "./transfer.js";
const SYSTEM = `You are Ava, the AI receptionist for Acme Dental.
You book appointments, answer basic questions, and transfer to a human when needed.
Be concise - one or two sentences per turn. Never invent dates, prices, or hours.
If you do not know, say so and offer to transfer.`;
export async function startAgent(ws: any) {
const history: { role: "user" | "assistant"; content: string }[] = [];
let streamSid: string | null = null;
let speaking = false;
async function onUserUtterance(text: string) {
if (speaking) await stopSpeaking(ws, streamSid!);
history.push({ role: "user", content: text });
const result = streamText({
model: anthropic("claude-sonnet-4-6"),
system: SYSTEM,
messages: history,
maxTokens: 200,
tools: {
lookupAvailability: tool({
description: "Find open appointment slots in the next 14 days.",
parameters: z.object({
dayPreference: z.enum(["morning", "afternoon", "any"]),
}),
execute: async ({ dayPreference }) => {
return { slots: await db.findSlots(dayPreference) };
},
}),
bookMeeting: tool({
description: "Book an appointment at the given ISO timestamp.",
parameters: z.object({
isoTime: z.string(),
patientName: z.string(),
phone: z.string(),
}),
execute: async (input) => db.book(input),
}),
transferToHuman: tool({
description: "Warm-transfer to the front desk.",
parameters: z.object({ reason: z.string() }),
execute: async ({ reason }) => transferCall(streamSid!, reason),
}),
},
});
let buf = "";
speaking = true;
for await (const delta of result.textStream) {
buf += delta;
const sentences = buf.match(/[^.!?]+[.!?]+/g);
if (sentences) {
for (const s of sentences) await speak(ws, streamSid!, s.trim());
buf = buf.replace(sentences.join(""), "");
}
}
if (buf.trim()) await speak(ws, streamSid!, buf.trim());
speaking = false;
const full = await result.text;
history.push({ role: "assistant", content: full });
}
return {
setStreamSid: (s: string) => (streamSid = s),
onUserUtterance,
onPartial: () => {},
close: () => {},
};
}ElevenLabs streaming TTS
ElevenLabs' streaming endpoint returns PCM audio chunks as soon as synthesis starts. The function below ships each sentence to ElevenLabs, downsamples PCM 24kHz to mulaw 8kHz, and base64-encodes into Twilio Media Stream media messages. The mark event lets you track when each sentence finishes playback - useful for interruption.
// src/tts.ts
import { ElevenLabsClient } from "elevenlabs";
import { downsample24to8, linear16ToMulaw } from "./audio.js";
const eleven = new ElevenLabsClient({ apiKey: process.env.ELEVENLABS_API_KEY! });
const VOICE_ID = "21m00Tcm4TlvDq8ikWAM";
let currentMarkId = 0;
let cancelled = false;
export async function speak(ws: any, streamSid: string, text: string) {
cancelled = false;
const markId = `m-${++currentMarkId}`;
const stream = await eleven.textToSpeech.convertAsStream(VOICE_ID, {
text,
model_id: "eleven_turbo_v2_5",
output_format: "pcm_24000",
optimize_streaming_latency: 3,
});
for await (const chunk of stream) {
if (cancelled) return;
const pcm8k = downsample24to8(chunk);
const mulaw = linear16ToMulaw(pcm8k);
ws.send(JSON.stringify({
event: "media",
streamSid,
media: { payload: mulaw.toString("base64") },
}));
}
ws.send(JSON.stringify({
event: "mark",
streamSid,
mark: { name: markId },
}));
}
export async function stopSpeaking(ws: any, streamSid: string) {
cancelled = true;
ws.send(JSON.stringify({ event: "clear", streamSid }));
}Interruption handling
The single most jarring failure mode of a voice agent is talking over the caller. Humans take turns. If the caller starts speaking while the agent is mid-sentence, the agent must stop talking inside ~150ms and listen. The mechanic is straightforward: Deepgram's VAD events fire on speech onset, and the agent watches those events while TTS is playing. On a fire, you do three things in this order:
- Stop TTS streaming locally. Set the
cancelledflag so the activespeak()loop exits on the next iteration. - Clear Twilio's playback buffer. Send the
clearMedia Stream event, which drops any audio Twilio has buffered but not played. - Abort the in-flight LLM stream. The model is probably mid-generation. Cancel the request and discard the partial response from history.
Skipping the third step is the most common bug. If you keep the partial assistant turn in history, the next turn references a sentence the user never heard, and the agent talks past itself. Drop partial turns the moment an interruption fires.
Conversation state - keep it simple
Resist the urge to add RAG architecture to your voice agent on day one. The conversation state for a call is short, and long context hurts latency. Hold the last 20 turns in memory. For longer context (caller history, past appointments, account data), fetch it once at call start, summarize it into a single system prompt block, and inject it before the conversation history.
Tool result memory matters too. When a tool returns a list (available slots, order line items, FAQ entries), keep the full result in memory keyed by the tool call ID and reference it on follow-up turns. Re-querying the same tool every turn doubles your latency budget and makes the agent feel forgetful.
Warm transfer to human
Every production agent needs a graceful escape hatch. The pattern is a warm transfer: the agent recognizes it cannot help, summarizes the situation, and bridges a human into the call. Twilio's <Dial> verb is the mechanic. The trigger lives in the agent'stransferToHuman tool - when the LLM calls it, you update the live call with new TwiML that dials the human.
// src/transfer.ts
import twilio from "twilio";
const client = twilio(
process.env.TWILIO_ACCOUNT_SID!,
process.env.TWILIO_AUTH_TOKEN!
);
export async function transferCall(streamSid: string, reason: string) {
const call = await client.calls.list({ limit: 1 });
const callSid = call[0].sid;
const twiml = `<?xml version="1.0" encoding="UTF-8"?>
<Response>
<Say voice="Polly.Joanna">Transferring you to the front desk now.</Say>
<Dial answerOnBridge="true">
<Number>+15555550100</Number>
</Dial>
</Response>`;
await client.calls(callSid).update({ twiml });
return { transferred: true, reason };
}The escalation prompt in the system message matters as much as the tool. Be explicit about when to escalate: any billing dispute, any request to speak to a human, any topic outside the agent's capability list, any two consecutive turns where the caller seems frustrated. Erring toward escalation is always cheaper than a bad call that ends with a refund request.
Cost per call
A fully loaded 3-minute call on this stack runs $0.40 to $1.20. The breakdown is predictable enough to model precisely - the only variable that swings wildly is TTS voice quality. Premium voice clones cost 3 to 10x the standard ElevenLabs voices.
| Component | Unit price | 3-min call cost | Notes |
|---|---|---|---|
| Twilio inbound voice (US) | $0.0085 per min | $0.026 | + $1.15/mo per number |
| Twilio Media Streams | $0.004 per min | $0.012 | Bidirectional audio |
| Deepgram Nova-3 streaming | $0.0043 per min | $0.013 | ~50% utilization (only caller speaks) |
| Claude Sonnet 4.6 | $3 / M in, $15 / M out | $0.02 to $0.08 | ~6K input, ~1K output across 3 min |
| ElevenLabs Turbo v2.5 | $0.10 to $0.30 per 1K chars | $0.30 to $1.00 | ~3K chars TTS in a 3-min call |
| Total per 3-min call | - | $0.40 to $1.20 | TTS dominates; voice cloning pushes upper bound |
For high-volume use cases - outbound campaigns at 50K+ minutes per month - the LLM token bill becomes the dominant variable. Caching the system prompt (Anthropic prompt caching, similar trick for OpenAI) cuts input cost by roughly 90%. My OpenAI API cost breakdown covers the caching tricks across providers - they apply directly to voice workloads.
Production gotchas
The five issues below are the ones that always bite on first ship. None of them appear in tutorials, all of them ruin demos in front of real customers.
- Bad mobile networks degrade STT. Cell handoffs and weak signal injection 80ms of jitter and 2 to 5% packet loss. Deepgram is robust to this but you should still monitor word-error rate per call and route long-tail calls to a human.
- Accents and code-switching. Default Nova-3 is tuned for North American English. For Indian English, Australian English, or Spanish-English code-switching, switch the model variant and rerun your eval set. Accuracy gaps of 10 percentage points are common.
- Voicemail detection on outbound. Twilio AMD adds 2 to 4 seconds of latency on call answer. Without it, you waste minutes talking to voicemail. With it, you must handle the small window where AMD misclassifies a human as a machine - usually by speaking a neutral opener and letting the caller respond before committing to the campaign script.
- Background noise and music on hold. Restaurants, construction sites, and cars trigger constant VAD false positives. Raise Deepgram's VAD threshold for outbound campaigns and add a silence-detection timeout (the agent gives up after 6 to 8 seconds of unintelligible audio).
- Regulatory plumbing. TCPA in the US (express written consent for AI-generated voice calls to mobiles), GDPR in the EU (lawful basis, data retention, right to erasure on transcripts), two-party consent recording laws in California / Florida / Illinois, and the EU AI Act's AI-system-disclosure requirements. Build opt-out into the first turn ("press 9 to speak with a person"), retain transcripts no longer than you need, and honor DNC lists.
Build vs buy
The SaaS voice agent space is now mature. Vapi, Retell, Bland, Synthflow, and ElevenLabs Conversational all let you ship a working agent in a day. The build-vs-buy question turns on three variables: call volume, customization needs, and how much of the call you need to own.
| Path | Time to ship | Cost | Best for |
|---|---|---|---|
| Vapi | 1 day | $0.05/min platform + provider costs | Fastest path to working agent, good defaults |
| Retell | 1 day | $0.07/min all-in (US) | Phone-first SaaS, strong call analytics |
| Bland | Hours | $0.09/min | Outbound campaigns, ready-made workflows |
| Synthflow | Hours (no-code) | $0.13/min | Non-technical teams, drag-and-drop flows |
| Custom (this post) | 2 to 4 weeks | $0.40 to $1.20 per 3-min call | Voice cloning, deep integration, high volume |
The crossover where custom wins economically is around 20,000 to 50,000 minutes per month - the SaaS per-minute markup at that volume exceeds the engineering and ops cost of owning the stack. Below that, the SaaS path almost always wins. Above 100K minutes, custom wins decisively. The exception is anything that needs a specific cloned voice or deep integration into existing infrastructure (a custom CRM, an EHR, a proprietary scheduling system) - those land in custom from day one.
The privacy and compliance angle
Voice agents touch personally identifiable information by default - names, phone numbers, often birthdates and account numbers. The compliance surface is real and most teams treat it as an afterthought. Three things to bake in from day one:
- Recording consent and disclosure. Open every call with a single-sentence disclosure ("You are speaking with an AI assistant. This call is recorded."). In two-party consent jurisdictions (CA, FL, IL, MA, MD, MT, NH, PA, WA in the US), this is required for recording. The EU AI Act requires AI-system disclosure regardless. Skip recording entirely for any caller who declines.
- PII redaction in transcripts. Anything stored long-term (transcripts for QA, training data) should have credit card numbers, SSNs, and account numbers redacted at ingest. Deepgram offers a built-in PII redaction layer; bolt on a regex pass for any custom identifiers your domain has.
- Encryption at rest and short retention. Store transcripts encrypted (AES-256 at minimum, your cloud provider handles this if you enable it). Set a retention policy that matches your stated privacy policy - 30 to 90 days is typical, and shorter is better for both compliance and storage cost. Honor erasure requests inside the GDPR-required 30 days.
For higher-stakes domains (healthcare, finance, anything regulated), layer in human-in-the-loop review for any action with monetary or medical consequence. The agent proposes; a human approves before the booking is confirmed or the prescription is renewed. Costs a few seconds of latency, saves a compliance disaster.
If you are scoping a voice agent build and want a senior engineer who has actually shipped one, my AI agent development practice covers exactly this scope, and AI integration when the agent needs to wire deep into existing systems. I work with teams worldwide and you can also hire an AI developer in Kosovo directly. Same person who built Caldra AI and the outbound voice stack behind VC Automation.
Frequently asked questions
What is an AI voice agent?
An AI voice agent is a system that answers or places phone calls, listens to a caller in real time, generates a response with a language model, and speaks the response back through a synthesized voice. In 2026 the bar is sub-1-second response latency, mid-sentence interruption handling, and tool calls that can book meetings, look up orders, or hand off to a human. The four pieces under the hood are telephony, speech-to-text, an LLM, and text-to-speech, wired together over a streaming WebSocket so audio flows in both directions continuously.
How much does an AI voice agent cost per call?
A 3-minute call on a custom stack runs roughly $0.40 to $1.20 fully loaded: about $0.04 in telephony, $0.01 in speech-to-text, $0.02 to $0.10 in LLM tokens, and $0.30 to $1.00 in text-to-speech depending on the voice. Voice cloning and premium ElevenLabs voices push TTS to the top of that range. SaaS platforms like Vapi or Retell charge $0.05 to $0.15 per minute on top of the underlying provider costs, so a 3-minute call there lands at $0.60 to $1.80.
Should I build a voice agent or use Vapi, Retell, or Bland?
Use a SaaS like Vapi, Retell, or Bland when you want a working agent in a day, do not need a custom cloned voice, and your call volume is under roughly 10,000 minutes per month. Build custom when you need a specific voice clone, deep integration into existing infrastructure, regulated data residency, or your monthly volume is high enough that the SaaS per-minute markup exceeds the engineering cost of owning the stack. The crossover is usually between 20K and 50K minutes per month.
What latency do I need to hit for a voice agent to feel natural?
End-to-end response latency under 800ms feels conversational. Between 800ms and 1.2s feels slightly off but still acceptable. Past 1.5s callers think the line dropped and start talking over the agent. The budget breakdown that hits sub-800ms: 100 to 250ms for streaming speech-to-text, 200 to 500ms for time-to-first-token from the LLM, and 200 to 400ms for first-audio from TTS. Streaming every stage in parallel is mandatory.
Do I need to disclose that a caller is talking to an AI?
In many jurisdictions, yes. California (SB 1001) requires disclosure for commercial bots in some contexts. The FCC has ruled AI-generated voices in robocalls are covered by TCPA. The EU AI Act requires that callers know they are interacting with an AI system. The practical rule is to disclose at the start of every call (a single sentence is enough) and to obtain explicit consent before recording. Cheaper than a regulatory complaint.
Can a voice agent transfer the call to a human?
Yes, and it should. The standard pattern is a warm transfer: the agent recognizes it cannot help, summarizes the conversation, uses Twilio's Dial verb to bridge a human, and optionally stays on the line until the human picks up. The agent prompt should include an explicit escalation tool that triggers when the caller asks for a person, when the LLM's confidence drops, or when the conversation hits a sensitive topic (billing disputes, cancellations, medical or legal matters).
How do I handle interruptions when the agent is speaking?
Run voice activity detection (VAD) on the inbound audio stream and watch for speech while you are mid-TTS playback. When speech is detected, immediately stop sending TTS audio to Twilio (a Twilio mark message or a clear queue), cancel the in-flight TTS generation, and start a fresh STT session. Deepgram and ElevenLabs both expose this primitive. The user should never feel they have to wait for the agent to finish a sentence before they can speak.
What about voicemail detection and answering machines?
Twilio offers machine detection (AMD) that returns whether the call was answered by a human, a machine, or a fax. For outbound campaigns this is mandatory: hanging up on a voicemail without leaving a message is wasted spend, and leaving an awkward dead-air message hurts deliverability. The better pattern is to let AMD complete (it adds 2 to 4 seconds of latency on answer), then either start the agent for a human or play a prerecorded voicemail message for an answering machine.