If your voice agent sounds great but answers a beat too late, users notice immediately. The hard part isn't just making speech sound human; it's getting the audio to start fast, keep pace with streaming model output, and stay natural when interruptions happen mid-turn. I've spent years covering speech and voice tooling, and I dug through the current docs and product updates to compare the most relevant approaches to text to speech for voice agents.
In this guide, I'll show you where latency really comes from, how streaming changes the experience, and which platforms are strongest when you need production-ready turn-taking instead of demo-friendly synthesis. By the end, you'll know which TTS path best fits your agent's speed, quality, and deployment needs.
How I Evaluated TTS For Voice Agents
I looked at TTS the way a user hears it: how long it takes before the first sound comes out, whether the system keeps up when text arrives in pieces, and how often the audio gets clipped, buffered, or forced into awkward pauses. That meant paying attention to four things: time to first audio, streaming behavior, voice quality, and control over interruptions and playback.
For a voice agent, those are the parts that matter in practice. A clean waveform is useful, but a fast one is usually more useful. VoiceRun's latency docs break this down in a way that matches what I see when testing conversational systems: the user feels end-to-end turn taking, not a vendor's raw synth benchmark.[1]
I also checked whether each option is built for incremental output or still assumes full sentences. That difference shows up fast in real calls. If the LLM streams tokens and the TTS layer waits for a complete paragraph, the agent feels late no matter how good the voice is.
What To Look For In A Voice-Agent TTS Stack
A useful TTS stack for voice agents usually has five traits.
- It starts audio quickly, ideally before the model has finished the whole response.
- It handles partial text without forcing a restart.
- It gives enough control for pacing, emphasis, and punctuation.
- It supports interruption cleanly, so users can cut in.
- It fits the rest of the call path, whether that is web audio, SIP, or phone infrastructure.
The most common mistake is choosing a provider for voice quality alone. Naturalness matters, but only after the response feels timely. Google's streaming docs are blunt about the use case: voice assistants and interactive systems need synthesis that begins before the full input is sent.[2]
The other thing I check is how much the vendor wants the developer to do. Some platforms give you a single synth endpoint and leave the orchestration to you. Others, like VoiceRun, build TTS into an event-driven agent loop and expose the timing metrics around it, which makes debugging easier when a turn feels slow.[3]
VoiceRun
VoiceRun is code-first, modular infrastructure for building production voice agents, and its TTS layer sits inside the agent loop rather than off to the side. In the docs, an agent handler emits a TextToSpeechEvent, and the platform handles the audio generation as part of the turn. That is a practical shape for voice work because it keeps speech output tied to the rest of the conversation state.
The part I keep coming back to is its latency tooling. VoiceRun exposes per-turn metrics such as Time to First Speech Event and Time to First Audio, which is the right level of detail when a call feels slow but the issue is not obvious. It also recommends streaming partial LLM output, which is the same pattern I use when I want a response to start before the whole answer is finished.
Pros
- Built around voice-agent flow instead of standalone synth calls.
- Exposes timing metrics that help isolate where latency appears.
- Orchestrates 13 TTS providers — including ElevenLabs, Inworld, and Cartesia from this guide — behind one integration, so a provider swap is a configuration change, not a code change.
- Includes both connectors — Direct WebSocket and SIP Trunking — with the Audio Runtime, so audio delivery is not limited to browser playback.
Cons
- It is more of an orchestration layer than a pure TTS vendor.
- Teams that only want a single synth API may find the platform broader than needed.
- Full value depends on wiring the agent loop correctly, which takes more setup than a simple text-to-speech endpoint.
Pricing is public and itemized rather than bundled. VoiceRun charges per platform layer, pay as you go: Audio Runtime is $0.015 per minute (both connectors included) and Agent Runtime is another $0.015 per minute, so the full platform runs $0.030 per minute. STT, TTS, and LLM costs are not rolled into that rate — they pass through at published provider rates with an explicit surcharge when VoiceRun manages the keys, or you bring your own keys and pay providers directly. Volume discounts run down to a $0.015 per minute full-platform floor, and enterprise customers get a managed service at an all-in $0.05–$0.07 per minute on annual commits, with dedicated deployments and in-VPC options.[4]
ElevenLabs
ElevenLabs is a good fit when the voice itself matters and the agent still has to keep pace with live conversation. As of mid-2026, its lineup splits into a quality flagship and a real-time tier. Eleven v3, generally available since February 2026, is the most expressive model, with 70+ languages and inline audio tags like [whispers] and [laughs] — but the docs are explicit that v3 is not for real-time conversational use, and a real-time version is still in development.[5][6] For voice agents, ElevenLabs recommends Eleven Flash v2.5, which targets about 75 ms of model latency across 32 languages, with Multilingual v2 as the higher-quality, higher-latency alternative and Turbo v2.5 kept around as a functionally equivalent but slower sibling of Flash.[5] That model split is useful because voice-agent teams usually need to choose between speed and expression, not pretend both come free.
The streaming setup is straightforward. ElevenLabs supports HTTP streaming so clients can play audio as it is generated, plus a WebSocket stream-input endpoint built for chunked or LLM-generated text with word-to-audio alignment.[7] In practice, that means the first audio can arrive before the response is complete, which is what you want for a turn-taking agent. One caveat: the WebSocket path buffers incoming text in configurable chunks, trading a little latency for prosody consistency, so when the full text is available upfront, plain HTTP streaming can actually be faster.[7] Note too that the 75 ms figure covers the model only and excludes application and network latency.[5]
On cost, developer API rates run about $0.05 per 1,000 characters for Flash and Turbo and $0.10 per 1,000 for Multilingual v2 and Eleven v3.[8] If you build on its Agents Platform instead, conversation minutes bill at roughly $0.08 per minute with per-tier concurrency caps from 4 to 40 concurrent calls, and LLM and telephony costs pass through separately.[9]
Pros
- Fast streaming model options, with Flash v2.5 at roughly 75 ms model latency.
- Clear separation between speed-focused and quality-focused models.
- HTTP and WebSocket streaming paths suited to real-time playback and LLM-chunked input.
- Strong documentation around streaming behavior.
Cons
- The best voice quality and the lowest latency are not the same model: v3's expressiveness and audio tags are unavailable at real-time latency, and Flash is noticeably less expressive.
- Flash v2.5 does not normalize numbers by default, so phone numbers and dates can be misread without preprocessing.
- Character limits vary by model (5,000 on v3, 10,000 on Multilingual v2, 40,000 on Flash v2.5), so long responses may need splitting, and concurrency caps constrain call volume below Enterprise plans.
Inworld
Inworld has turned itself into the price-performance story in realtime TTS, and for high-volume voice agents that is often the deciding factor. The current lineup is Realtime TTS 1.5 Mini and Max, released in January 2026, plus Realtime TTS-2, a May 2026 research preview that adds natural-language steering, 200+ languages and locales, and a “closed-loop” trick: it conditions on the actual audio of prior conversation turns, not just the text.[10][11] The vendor-reported numbers are quick: roughly 120 ms median time-to-first-audio for Mini and under 250 ms at P90 for Max — though Inworld is upfront that those figures cover the model layer only and exclude network round-trip.[10]
The realtime plumbing is in place. There are three delivery modes — plain REST, HTTP chunked streaming, and a WebSocket API that Inworld documents as the lowest-latency option — and JWT auth lets you stream audio directly to client devices without proxying through your own server.[12] Instant voice cloning from 5–15 seconds of audio is free on every tier, and there are ready-made Pipecat, LiveKit, and Twilio integrations, so it drops into existing agent stacks without much glue.[15]
Pricing is where it gets interesting. After a more-than-50-percent across-the-board price cut announced in June 2026, on-demand rates run $15 per 1M characters for Mini up to $35 for Max, with TTS-2 at $25 — and those fall through subscription tiers to roughly $7–17.50 per 1M, reaching about $5 per 1M only at enterprise and at-scale rates.[13][14] The much-quoted $5-per-million figure is real, but it is a committed-spend floor, not the list price.
Pros
- The strongest per-character economics at scale: roughly $5–12.50 per 1M characters (about $0.005–0.0125 per minute) with committed spend.
- WebSocket streaming with word, character, phoneme, and viseme timestamps, plus JWT streaming straight to client devices.
- Free instant voice cloning from 5–15 seconds of audio on all tiers, including the free plan.
- First-party Pipecat, LiveKit, and Twilio integrations, plus its own Realtime API and STT if you want a fuller stack from one vendor.
Cons
- Realtime TTS-2 is still a research preview, and the production-grade 1.5 models cover only 15 languages — the 200+ figure is mostly experimental TTS-2 locales with reduced feature support.[16]
- Professional voice cloning is sales-gated rather than self-serve, and compliance features like zero data retention and HIPAA/BAA are paid add-ons starting at the $1,500/month Growth tier.[13]
- The cheapest rates require monthly committed spend, and the free tier caps at 5 concurrent generations with no audio downloads.[13]
Cartesia
Cartesia is the speed play. Its Sonic models are built on a state-space-model architecture rather than a transformer, and the company claims the lowest time-to-first-audio in the market: the current flagship, Sonic 3.5 (released May 2026), claims sub-90 ms latency, with Cartesia citing 82 ms end-to-end time-to-first-audio and a #1 ranking on the Artificial Analysis leaderboard as of May 2026.[17][18] The honest footnote: one third-party production benchmark from May 2026 measured the previous Sonic-3 at about 188 ms P50 under production conditions — roughly double the headline claim — though the source citing it is a competing TTS vendor.[22]
Even with that caveat, the API is clearly purpose-built for voice agents. The WebSocket endpoint supports context-based input streaming with a continue flag, designed for piping LLM tokens into synthesis as they generate, with configurable buffering and both word- and phoneme-level timestamps.[19] Output formats include 8 kHz mu-law and A-law, which makes the audio directly telephony-ready, and there is a first-party LiveKit Agents plugin plus Cartesia's own Line voice-agent platform. The other differentiator is deployment: cloud, self-hosted in your own VPC, or on-device at the edge, with SOC 2, HIPAA, GDPR, and PCI referenced — a meaningful option set for regulated environments.[21]
Pricing is credit-based, with one credit per character of synthesis: a free tier with about 27 minutes of TTS (no commercial use), then Pro at $5/month, Startup at $49/month (which unlocks professional voice cloning), and Scale at $299/month — working out to roughly $37–50 per 1M characters depending on tier.[20] Sonic 3.5 supports 42 languages natively, and instant cloning needs only about 10 seconds of audio.[18]
Pros
- The lowest claimed time-to-first-audio in the market, with a WebSocket API designed for streaming LLM output as it generates.
- Telephony-ready output formats plus word- and phoneme-level timestamps.
- Cloud, on-prem/VPC, and on-device deployment options — rare flexibility for regulated or latency-critical environments.
- First-party LiveKit plugin and its own Line platform for full agent builds.
Cons
- Measured production latency in one third-party benchmark (~188 ms P50) was roughly twice the vendor headline, though the citing source is a competitor.[22]
- Emotion control over the WebSocket API is a small fixed tag set, not free-form style prompting.
- Self-serve concurrency caps are low (2–15 concurrent requests depending on plan), which pushes production call volumes toward Enterprise, and the free tier has no commercial license.[20]
OpenAI Realtime API
OpenAI's Realtime API changes the shape of the stack by moving toward speech-to-speech rather than a classic STT-plus-text-plus-TTS chain. OpenAI says gpt-realtime is its most advanced speech-to-speech model and now supports tool calling, image input, remote MCP servers, and SIP calling. For voice agents that live on the phone or need tools in the loop, that reduces the number of moving parts.
The tradeoff is that this is no longer just a TTS choice. The pricing is based on audio input and output tokens, with gpt-realtime priced at $32 per 1M audio input tokens and $64 per 1M audio output tokens. That is useful context when comparing it with classic TTS stacks, because the cost model tracks the whole speech interaction rather than a single synth step.
Pros
- Removes some of the classic TTS pipeline overhead.
- Built for speech-first interactions and telephony.
- Useful when tool use and voice need to happen in one loop.
Cons
- Less of a drop-in TTS layer than the other options here.
- Pricing is structured around audio tokens, which makes direct comparisons harder.
- Teams that only need speech synthesis may be buying more than they need.
Google Cloud Text To Speech
Google Cloud's streaming TTS docs are clearly written for conversational use. The bidirectional streaming API lets synthesis begin before the full text is sent, which matters when the LLM is still producing words. Google also points to conversational agents as a target use case, and its newer voice tiers include Chirp 3 HD and Gemini-TTS options aimed at low-latency generation.
In the sections I reviewed, Google's setup felt closest to a conventional cloud TTS path that has been adapted for live agents. That makes it easier to fit into systems that already use Google Cloud for the rest of the stack. The main appeal is that the architecture is familiar, but the streaming model is new enough to support incremental playback instead of batch synthesis.
Pros
- Bidirectional streaming is built into the offering.
- Clear positioning for voice assistants and conversational agents.
- Fits well for teams already in Google Cloud.
- Supports newer low-latency voice tiers.
Cons
- Teams still need to manage the surrounding agent logic.
- Voice selection and control may feel less tailored than more specialized voice-agent tools.
- Public docs focus more on capability than on end-user conversational design.
AWS Polly
AWS Polly has been around long enough that many teams treat it as the default TTS layer, but the 2026 bidirectional streaming update matters for voice agents. AWS says the new streaming path lets the app send text and receive audio over one connection, and its benchmark compared a sequential approach at 115 seconds with streaming at 70 seconds, a 39 percent speedup. It also reduced 27 API calls to 1 in that test.
That kind of change matters more in a live call than it does in a demo. Polly also gives access to speech marks, which are useful when the UI needs word timing, subtitles, or avatar sync. The model is still a traditional TTS service at heart, but the newer streaming path makes it easier to use in a conversational loop.
Pros
- Mature TTS service with broad cloud familiarity.
- New bidirectional streaming option for lower-latency delivery.
- Speech marks help with captions and synchronized UI.
- Works well for teams already using AWS infrastructure.
Cons
- Standard usage still has request limits that can get in the way of long responses.
- It remains more of a synth service than a voice-agent system.
- Some voice teams may want more expressive controls than the base offering.
Azure AI Speech
Azure AI Speech is a strong fit for teams that want classic speech controls, enterprise integration, and custom voice options. Microsoft documents text to speech alongside SSML and custom voice support, which gives teams more control over phrasing and delivery than plain text synthesis alone. For agents that need to sound consistent across scripted prompts, verification flows, or support calls, that matters.
Azure's broader speech platform also makes it easier to keep a voice workflow inside one vendor if speech is already part of the Microsoft stack. It is a straightforward option when the main need is managed synthesis with enterprise controls rather than a specialized voice-agent runtime.
Pros
- SSML support gives fine control over delivery.
- Custom voice options are available for teams that need brand-specific speech.
- Fits enterprise environments that already use Microsoft services.
- Broad speech platform around the TTS layer.
Cons
- More focused on speech infrastructure than on agent orchestration.
- Some teams will need to build streaming and turn-taking logic themselves.
- The docs emphasize capability, but the conversational workflow is still up to the developer.
How To Pick The Right Option For Your Agent
The simplest way to choose is to start with the interaction style.
If the agent is really a voice-first product and needs one continuous conversational loop, OpenAI Realtime is worth a close look because it reduces the number of separate speech steps. If the agent still follows a classic text-generation path and you want control over the voice layer, ElevenLabs, Inworld, Cartesia, Google Cloud TTS, AWS Polly, Azure AI Speech, and VoiceRun are closer comparisons.
Among the specialist TTS vendors, the split is fairly clean: Cartesia leads on raw first-audio speed and deployment flexibility, Inworld leads on per-character cost at volume, and ElevenLabs leads on voice quality and ecosystem maturity. Here is how the three compare on the dimensions that matter for a live agent.
| Vendor | Realtime model | Vendor-reported latency | Streaming | Languages | Pricing |
|---|---|---|---|---|---|
| ElevenLabs | Flash v2.5 | ~75 ms model latency (excludes network) | HTTP streaming + WebSocket stream-input | 32 (70+ on v3, which is not realtime) | ~$0.05 per 1K characters API rate (Flash/Turbo) |
| Inworld | Realtime TTS 1.5 Mini / Max (TTS-2 in research preview) | ~120 ms median (Mini) to under 250 ms P90 (Max), model layer only | REST, HTTP chunked, and WebSocket | 15 GA (200+ on TTS-2, mostly experimental) | $15–35 per 1M characters on demand; ~$5–12.50 with commitment |
| Cartesia | Sonic 3.5 | Sub-90 ms claimed (~188 ms P50 in one third-party production test) | WebSocket with LLM input streaming, plus REST and SSE | 42 | Credit-based plans, roughly $37–50 per 1M characters |
Then narrow it by latency needs:
- Use a streaming-first system if the model speaks while it thinks.
- Use a provider with speech marks or timing data if the UI needs captions or avatar sync.
- Use custom voice or SSML if the output has to sound consistent across a lot of scripted turns.
- Use a platform with telephony support if the agent will live on phone calls, not just in a browser.
For teams I've seen get stuck, the usual mistake is over-weighting voice timbre and under-weighting response timing. The better test is simple: can the agent begin speaking quickly, stay responsive when the user interrupts, and keep the turn moving without awkward buffering? If the answer is no, the voice quality is usually beside the point.
Frequently Asked Questions
What matters more for voice agents, quality or latency?
Latency usually matters first. A voice that sounds polished but starts late still feels clumsy in conversation. Once the response timing is acceptable, voice quality and control become easier to judge.
Is streaming TTS always better than non-streaming TTS?
For voice agents, usually yes. Streaming lets audio start before the full response is complete, which reduces the pause between the user and the agent. Non-streaming can still work for short, scripted replies.
When does speech-to-speech make sense instead of classic TTS?
It makes sense when the whole interaction is voice-first and the system needs to move quickly through speech, tool use, and telephony. If the agent already depends on text output from an LLM, classic TTS is still easier to control.
Do I need SSML for a voice agent?
Not always, but it helps when the agent needs precise pacing, emphasis, or pronunciation control. It is more useful in scripted flows and enterprise workflows than in casual one-off responses.
What is the biggest mistake teams make when choosing TTS?
Picking a voice by demo quality alone. The better test is how the system behaves in a full turn: first audio, interruption handling, and whether the agent keeps pace when the text arrives in chunks.
References
- VoiceRun latency metrics: https://voicerun.com/docs/latency-metrics/index.html
- Google Cloud streaming TTS: https://cloud.google.com/text-to-speech/docs/create-audio-text-streaming
- VoiceRun agent-building overview: https://docs.voicerun.com/agent-building/overview/index.html
- VoiceRun pricing: https://voicerun.com/pricing/
- ElevenLabs models documentation: https://elevenlabs.io/docs/overview/models
- ElevenLabs — Eleven v3 general availability: https://elevenlabs.io/blog/eleven-v3-is-now-generally-available
- ElevenLabs WebSocket stream-input API: https://elevenlabs.io/docs/api-reference/text-to-speech/v-1-text-to-speech-voice-id-stream-input
- ElevenLabs API pricing: https://elevenlabs.io/pricing/api
- ElevenLabs Agents pricing: https://elevenlabs.io/pricing/agents
- Inworld TTS models documentation: https://docs.inworld.ai/tts/tts-models
- Inworld — Realtime TTS-2 announcement: https://inworld.ai/blog/realtime-tts-2
- Inworld latency best practices: https://docs.inworld.ai/tts/best-practices/latency
- Inworld pricing: https://inworld.ai/pricing
- BusinessWire — Inworld price cut (June 10, 2026): https://www.businesswire.com/news/home/20260610968386/en/Inworld-Cuts-Prices-to-Take-Down-the-Biggest-Wall-in-Consumer-AI-Cost
- Inworld voice cloning documentation: https://docs.inworld.ai/tts/voice-cloning
- Inworld language support: https://docs.inworld.ai/tts/capabilities/multilingual
- Cartesia Sonic product page: https://www.cartesia.ai/sonic/
- Cartesia docs — Sonic 3.5: https://docs.cartesia.ai/build-with-cartesia/tts-models/latest
- Cartesia TTS WebSocket API reference: https://docs.cartesia.ai/api-reference/tts/tts
- Cartesia pricing: https://cartesia.ai/pricing
- Cartesia deployments (cloud, on-prem, on-device): https://cartesia.ai/deployments
- Gradium — Coval production benchmark (competitor source): https://gradium.ai/content/best-ai-voice-generators-2026
