Blog

Text To Speech API For Voice Agents: 8 Best Options In 2026

Eight TTS APIs compared for production voice agents — streaming behavior, voice quality, multilingual coverage, telephony fit, and how each one plugs into an agent stack.

Illustration of a text-to-speech API streaming synthesized audio inside a voice agent pipeline

If you're building a voice agent, the TTS layer is where a lot of good demos fall apart. The voice sounds flat, latency drags, barge-in gets messy, or the API looks fine until you try to wire it into a real conversation flow. I've spent years covering speech and agent infrastructure, and I tested and researched these options with a focus on what actually matters in production: streaming behavior, voice quality, multilingual coverage, telephony fit, and how well each one plugs into an agent stack. In this article, I'm comparing eight of the strongest choices for teams evaluating spoken output for voice agents, including standalone TTS vendors, cloud platforms, and a code-first runtime that lets you swap providers without rebuilding the rest of your pipeline. You'll see where each option is strongest, where it breaks down, and which one makes the most sense for your use case.

In practice, the comparison came down to a few repeated checks: how fast speech starts, whether the system can keep streaming while the next turn is still forming, what the voice sounds like after a few minutes instead of a five-second sample, and whether the API fits a call flow without a lot of glue code. That is also where the differences between a pure TTS vendor and a voice-agent runtime start to matter.

How I Evaluated These Voice Agent TTS APIs

For this comparison, I looked at each API the way a team building live voice agents would. That meant checking whether the product could handle partial output cleanly, whether interruptibility was built in, and whether the voice could be routed through a phone stack without odd format conversions. I also paid attention to how much control each vendor gives over speed, voice selection, and style, since those are the knobs teams usually touch after the first demo.

I also separated standalone TTS from broader agent platforms. A few vendors sell only speech output. Others bundle speech-to-text, orchestration, and delivery in one stack. That difference changes the evaluation. If the team already has a server, a turn manager, and a telephony layer, a standalone TTS API can be enough. If the team wants to swap voices, log latency, and test interruption behavior inside one runtime, the comparison changes.

The last pass was about deployment fit. I checked output formats, streaming methods, multilingual support, and whether the provider publishes enough detail to estimate how it will behave under load. Voice quality matters, but in voice agents the slowest part of the stack usually decides the user experience first.

Text To Speech API For Voice Agents Comparison Table

PlatformStreamingVoice controlsLanguagesTelephony fitPricing / accessBest fit
VoiceRunEvent-based, provider-dependentInterruptible control, provider voice selection, speed where supportedDepends on selected providerStrong: SIP Trunking and Direct WebSocket connectors included, covering PSTN, PBX/CCaaS, and Twilio Media StreamsPublic pay-as-you-go: $0.030/min full platform ($0.015/min Audio Runtime alone) + provider costs as pass-through or BYO keysTeams that want one runtime with multiple TTS providers
ElevenLabsYes, including low-latency optionsVoice cloning, voice design, seed control29 to 70+ depending on modelGood, with telephony-ready output formatsSelf-serve pricing starts at $6/month; PAYG pricing introduced in 2026Teams prioritizing voice quality and expressive delivery
OpenAIYes, chunked streamingVoice selection, style instructions, speed, toneBroad, but voices are optimized for EnglishGood for app and agent stacks already using OpenAIAPI usage pricingTeams already using OpenAI for the rest of the stack
Amazon PollyYes, with generative bidirectional streamingSSML, multiple engines, neural and generative voicesBroad AWS voice coverageStrong, especially in AWS and Connect flowsPay per synthesized textAWS teams and enterprise call flows
Google Cloud Text-to-SpeechYes, including bidirectional streamingStyle and voice family options, Gemini-TTS controls75+ languagesGood for conversational agents and streaming appsPer-character pricingTeams that need broad language coverage
Azure AI SpeechYesSSML, custom voice, neural HD voicesBroad enterprise coverageStrong for compliance-heavy environmentsUsage-based, enterprise controlsTeams that need custom voice and Azure governance
DeepgramYes, low-latency streamingVoice selection inside bundled stackBroad, model-dependentStrong for live conversational systemsUsage-basedTeams that want bundled voice infrastructure
InworldYes, WebSocket and HTTP chunked streamingInstant voice cloning, natural-language steering on TTS-215 production languages; 200+ experimental on TTS-2Good, with Pipecat, LiveKit, and Twilio integration pathsFrom $15/1M characters on demand; lower on subscription and enterprise tiersCost-sensitive, high-volume realtime voice agents

VoiceRun

VoiceRun is not a standalone TTS vendor. It is the layer that lets a team use different voice providers inside one voice-agent runtime, which is a useful distinction if the work is more than a prompt-and-playback demo. In the docs, TTS is exposed as an event in the conversation flow, so synthesis is part of the agent pipeline rather than a separate endpoint bolted on afterward. That is the kind of structure that helps when the call flow needs to change quickly.

The practical value is provider flexibility. VoiceRun orchestrates across 13 TTS providers — OpenAI, Azure, Google Chirp, Cartesia, ElevenLabs, Fish Audio, Gradium, Inworld, MiniMax, Qwen3, xAI Grok, its own Prim Voices, and custom voices — and switching is a configuration change, not a code change. The platform also exposes interruptibility control on every provider, so an utterance can be marked non-interruptible when the agent needs to finish a line before yielding. For teams handling barge-in and turn-taking, that matters more than a glossy voice demo. VoiceRun also publishes a comparison table for per-provider streaming, voice instructions, speed control, caching, and interruptible control, which makes it easier to see where a given provider fits inside the same runtime.[1] Pricing is public and modular rather than a bundled per-minute rate: the Audio Runtime is $0.015/min, the full platform is $0.030/min, and provider costs are passed through at published rates — or you bring your own provider keys and pay the vendors directly.

Pros

  • Lets teams swap voice providers without rewriting the rest of the agent stack.
  • Treats TTS as an event in the conversation flow.
  • Includes interruptibility controls that are useful in live calls.
  • Fits a code-first, CLI-based workflow.

Cons

  • It is not a pure TTS vendor, so teams only shopping for speech output may find it broader than needed.
  • The per-minute platform fee sits on top of provider TTS rates, which only pays off if you are using the runtime, not just the voices.
  • Voice quality depends on the provider selected inside the runtime.

ElevenLabs

ElevenLabs is the clearest specialist choice in this set if the main concern is how the voice actually sounds in a live agent. The company's docs describe the API as a way to turn text into lifelike audio with control over pacing and emotional awareness. The current model set includes Eleven v3, Eleven Multilingual v2, Eleven Flash v2.5, and Turbo v2.5. Flash v2.5 is documented at about 75 ms latency, and the docs say it supports 32 languages. Multilingual v2 covers 29 languages, while Eleven v3 is listed at 70+ languages.[2]

ElevenLabs also has a more explicit voice-agent workflow than many TTS vendors. Its Speech Engine is designed around a browser-to-server loop where speech is transcribed, passed through your server logic, and synthesized back out. The voice catalog is large, the output formats include telephony-friendly options like μ-law and a-law, and the product supports cloning and voice design. One detail that comes up in real testing is that the output is nondeterministic by default, so the seed parameter matters when teams want a more repeatable voice across runs.

Pros

  • Strong voice quality and expressive delivery.
  • Low-latency options are documented.
  • Large voice library with cloning and voice design.
  • Telephony-friendly output formats.

Cons

  • Pricing changed recently, so teams need to recheck current tiers.
  • Output can vary unless a seed is used.
  • It is still a specialist TTS layer, not a full agent runtime.

OpenAI

OpenAI's speech stack is a natural fit for teams already building around its broader API ecosystem. The Audio API exposes a speech endpoint built on gpt-4o-mini-tts, with streaming support and output formats including mp3, opus, aac, flac, wav, and pcm. The docs also note a 4096-character maximum input for the speech endpoint, which is the sort of detail that matters when agents produce long answers and you need to split them before synthesis.[3]

The voice controls are fairly direct. OpenAI documents instruction-based control over accent, emotional range, intonation, speed, tone, and whispering. That makes the API useful for teams that want to steer delivery without building a separate style layer. The caveat is also in the docs: the built-in voices are optimized for English. That does not mean other languages are blocked, but it does mean teams should test output quality by language instead of assuming a single voice setting will behave the same everywhere.

Pros

  • Easy fit for teams already using OpenAI models.
  • Streaming is built into the speech workflow.
  • Clear control over delivery style through instructions.
  • Multiple output formats, including telephony-relevant ones.

Cons

  • Built-in voices are optimized for English.
  • The voice layer is still tied to the broader OpenAI stack.
  • Not the broadest choice if the main need is voice inventory.

Amazon Polly

Amazon Polly is the long-running enterprise option in this group, and the recent changes make it more relevant to voice agents than it used to be. Polly supports plain text and SSML, and AWS now offers four voice engines: Standard, Neural, Long-form, and Generative. On March 20, 2026, AWS announced 10 new generative voices, two additional regions, and a bidirectional streaming API for the Generative engine. That is a concrete step toward lower-friction live synthesis in conversation flows.[4]

The practical appeal of Polly is stability and fit inside AWS-centric stacks. The output formats are straightforward, telephony support is well documented, and AWS also notes that Polly voices can be used in Amazon Connect flows. For teams already handling contact-center traffic in AWS, that can reduce the number of moving parts. Polly is less focused on expressive cloning than some specialist vendors, but it is still a realistic production choice for teams that care about SSML, regional coverage, and predictable billing.

Pros

  • Strong fit for AWS and contact-center workflows.
  • SSML support is mature.
  • New generative voices and bidirectional streaming were added in 2026.
  • Clear output format support for speech delivery.

Cons

  • Less focused on cloning and highly expressive character work.
  • Voice selection can feel more utilitarian than specialist TTS tools.
  • Teams outside AWS may not get the same advantage from the stack.
The Amazon Web Services homepage featuring a banner about partner solutions and recent product updates.

Google Cloud Text-to-Speech

Google's TTS line is broader than many buyers first assume. The product page lists 380+ voices across 75+ languages and variants, and the newer voice families are aimed at conversational use cases. Google's voice docs show Chirp 3: HD voices marked GA for conversational agents, with streaming support, and the API reference includes bidirectional StreamingSynthesize for real-time synthesis. For teams that care about language coverage and incremental playback, that is a solid combination.[5]

The newer Gemini-TTS direction also matters. Google documents prompt-steerable style, accent, pace, tone, emotional expression, and single- or multi-speaker speech. The pricing page lists Chirp 3: HD voices at US$0.00003 per character, or US$30 per 1 million characters. That makes Google a practical option for teams that need scale and a wide language menu, though it is worth separating the older voice families from the newer conversational ones when comparing output quality.

Pros

  • Very broad language and voice coverage.
  • Streaming and bidirectional synthesis are documented.
  • Clear per-character pricing.
  • Newer voice families are aimed at conversational agents.

Cons

  • The product family is easy to misread as one uniform system.
  • Older and newer voices do not behave the same way.
  • Teams need to test the exact voice family, not just the brand name.
The Google Cloud homepage showing a headline, call to action buttons, and promotional content cards.

Azure AI Speech

Azure AI Speech is usually the choice for teams that care about enterprise controls, custom voice, and governance. Microsoft documents SSML, batch synthesis, custom voice, and neural HD voices in the same service family. The custom voice pipeline uses a text analyzer, a neural acoustic model, and a neural vocoder. Microsoft also says its Azure neural HD voices are LLM-based and optimized for dynamic conversations, which puts them in the same general category as other agent-oriented voices in this list.[6]

There are guardrails here as well. Microsoft limits access to custom voice based on eligibility and usage criteria, which can slow down projects that need a quick pilot. That is often the trade-off with Azure: more controls, more review, and a more formal path to production. For regulated teams, that can be acceptable. For a smaller product team trying to ship fast, it may be friction.

Pros

  • Strong custom voice and SSML support.
  • Good fit for enterprise governance and compliance needs.
  • Neural HD voices are aimed at dynamic conversations.
  • Well suited to Azure-centered infrastructure.

Cons

  • Custom voice access is restricted.
  • Setup can be heavier than lighter-weight TTS APIs.
  • Less convenient if the rest of the stack is outside Microsoft.

Deepgram

Deepgram has pushed Aura TTS as part of a broader voice-agent stack rather than a narrow speech-only product. The company describes Aura as built for responsive conversational agents and documents sub-200 ms streaming TTS. It also exposes WebSocket streaming, which is the kind of interface teams usually want when they are trying to keep playback tight and avoid unnecessary buffering.[7]

The other reason Deepgram belongs in this comparison is that it bundles speech-to-text, LLM orchestration, and TTS in one product family. That can reduce integration work if the goal is a full conversational stack rather than a single audio endpoint. The trade-off is the usual one with bundled platforms: the system is simpler to wire up, but the team is also committing to the platform's shape.

Pros

  • Low-latency positioning is clear.
  • WebSocket streaming is available.
  • Bundled voice-agent stack can reduce integration work.
  • Useful for teams building around one vendor for STT and TTS.

Cons

  • The platform is broader than a pure TTS API.
  • Teams may not need the full bundled stack.
  • Voice selection is narrower than larger catalog providers.

Inworld

Inworld is the price-performance entry in this group, and its Realtime TTS line is aimed squarely at realtime voice agents. The current lineup pairs Realtime TTS-2, a research preview announced in May 2026 with natural-language steering and 200+ languages and locales, with the production-grade Realtime TTS 1.5 Max and Mini models released in January 2026. The vendor-reported latency figures are aggressive: roughly 120 ms median time-to-first-audio for Mini and under 250 ms at P90 for Max, measured at the model layer and explicitly excluding network time.[8] Streaming comes in three modes — a REST endpoint, HTTP chunked streaming, and a WebSocket API that Inworld documents as its lowest-latency option — and JWT auth supports streaming directly to client devices without a server proxy.[9] There are also first-party Pipecat and LiveKit integrations and an official Twilio telephony guide, which shortens the wiring work in an agent stack.

The pricing is the headline, but it needs careful reading. After a more than 50% across-the-board price cut announced on June 10, 2026,[10] on-demand list prices run $15 per million characters for Mini, $25 for TTS-2, and $35 for Max, dropping through subscription tiers to roughly $7 to $17.50 per million and reaching the much-quoted $5 per million only at enterprise, at-scale rates.[11] Instant voice cloning from 5 to 15 seconds of audio is free on every tier, and the free plan includes up to 70 minutes of synthesis, though professional cloning is sales-gated and audio downloads require a paid plan. The other caveats: the production 1.5 models cover 15 languages — much of the 200+ figure is experimental TTS-2 locales with reduced feature support[12] — and compliance features such as zero data retention, HIPAA, and BAAs are paid add-ons starting at the $1,500 per month tier.

Pros

  • Aggressive per-character pricing that falls further with committed spend.
  • Sub-150 ms model-side time-to-first-audio on the Mini model, with WebSocket streaming.
  • Free instant voice cloning from 5 to 15 seconds of audio on all tiers.
  • First-party Pipecat, LiveKit, and Twilio integration paths for agent stacks.

Cons

  • Realtime TTS-2 is still a research preview, not generally available.
  • The production-grade 1.5 models cover only 15 languages.
  • The cheapest rates require committed monthly spend or an enterprise contract.
  • Compliance features like zero data retention and HIPAA are gated to higher tiers.

How To Choose The Right TTS API For Your Voice Agent

The shortlist usually narrows fast once the real constraints are visible. If the team needs a flexible agent runtime and expects to change TTS providers over time, VoiceRun is the most direct fit because the voice layer sits inside the workflow, not outside it. If the main requirement is how the voice sounds, ElevenLabs is the first place to look. If the team already uses OpenAI for the rest of the product, keeping speech in the same ecosystem is usually the simplest path.

For AWS, Google, and Azure teams, the choice tends to hinge on infrastructure rather than voice alone. Polly fits AWS-heavy call flows and now has bidirectional generative streaming. Google is stronger when the target list includes many languages or newer conversational voice families. Azure is the safer pick when the project needs custom voice, SSML, and tighter governance. Deepgram makes sense when the team wants a bundled voice-agent stack with low-latency output and does not want to assemble each layer separately. Inworld is the price-performance pick when TTS spend dominates the unit economics, as long as the team is comfortable with a 15-language production set and committed-spend pricing for the best rates.

A simple way to decide is to ask three questions:

  • Do I want a pure speech API or a full agent runtime?
  • Does my use case depend more on voice quality, latency, or compliance?
  • Am I optimizing for one provider or for the ability to swap providers later?

For most teams, the right answer is not the longest feature list. It is the system that keeps latency predictable and does not turn every voice change into a rebuild.

Frequently Asked Questions

What is the difference between a TTS API and a voice agent platform?

A TTS API only turns text into speech. A voice agent platform usually includes speech-to-text, orchestration, turn handling, and deployment tools in the same system.

Which TTS API is best for low-latency voice agents?

ElevenLabs, Deepgram, Inworld, and OpenAI are the most directly focused on low-latency streaming. VoiceRun is useful when latency matters but the team also wants provider flexibility inside one runtime.

Which TTS API has the best multilingual support?

Google Cloud Text-to-Speech has one of the broadest public voice inventories, with 380+ voices across 75+ languages and variants. ElevenLabs also has broad multilingual coverage, depending on the model.

Which option is best for enterprise compliance?

Azure AI Speech is usually the first place to look when custom voice access, governance, and enterprise controls matter more than quick self-serve setup.

Can these TTS APIs be used for phone calls?

Yes, but the fit varies. Look for streaming support, telephony-friendly formats such as μ-law or a-law, and clear interruptibility behavior before choosing one for live calls.

References

  1. docs.voicerun.com
  2. elevenlabs.io
  3. platform.openai.com
  4. aws.amazon.com
  5. cloud.google.com
  6. learn.microsoft.com
  7. deepgram.com
  8. docs.inworld.ai — TTS models
  9. docs.inworld.ai — latency best practices
  10. businesswire.com
  11. inworld.ai/pricing
  12. docs.inworld.ai — language support