If your voice agent sounds great in demos but falls apart on live calls, the STT layer is usually where the cracks show first. You need fast partials, clean endpointing, and transcripts that arrive in time for the agent to respond naturally — not a batch transcription tool pretending to be real-time. I've spent years covering and testing voice infrastructure products, and for this review I focused on how each option behaves inside an actual agent workflow: streaming support, interruption handling, language coverage, pricing transparency, and how much of the stack each vendor really gives you. This list compares six of the strongest options for teams building production voice agents, from dedicated transcription APIs to broader voice platforms that include STT as one layer — plus an orchestration layer that lets you pick the best engine for each use case instead of betting on a single vendor. Here's the clearest breakdown of which speech-to-text API for voice agents is the best fit for each kind of build.
How I Evaluated These Speech To Text APIs
I looked at these products the way a team would actually use them in production. That meant starting with streaming behavior, not just transcript quality on clean sample audio. For a voice agent, the first question is whether the API can keep up with real speech, including pauses, interruptions, and overlapping words from telephony audio.
I also checked how each vendor handles turn-taking. Some products expose explicit voice activity detection or turn detection. Others leave more of that logic to the application layer. That difference matters when a caller cuts off a prompt halfway through or starts speaking before the agent finishes its reply.
Pricing was another filter. Per-minute STT rates are easy to compare on paper, but they do not tell the full story once TTS, LLM calls, retries, and orchestration are part of the same session. I treated bundled voice platforms differently from pure transcription APIs for that reason.
Finally, I looked at the amount of control each stack gives the buyer. A team that wants a transcription endpoint only has different needs from a team that wants observability, telephony, deployment controls, and a way to inspect turn-by-turn latency after the call. That is where the line between STT API and voice platform gets more useful than the marketing page.
Speech To Text API For Voice Agents Comparison Table
| Provider | Streaming support | Turn detection / interruption handling | Multilingual support | Pricing visibility | Stack depth |
|---|---|---|---|---|---|
| VoiceRun | Yes, realtime streaming across 9 orchestrated STT models | Built in; STT model, language, and endpointing configurable per deployment, even mid-conversation | Depends on the STT engine you orchestrate | Public modular pricing: $0.015/min Audio Runtime, STT pass-through at published provider rates or BYO keys | Full voice-agent control plane and STT orchestrator |
| Deepgram | Yes | Built in for Flux | 10 languages for Flux English, 50+ for Nova-3 | Public per-minute pricing | STT plus voice-agent API |
| AssemblyAI | Yes, WebSocket streaming | Included in voice-agent workflow | Supports code switching and keyterms | Public credits shown, exact STT rate not visible in retrieved snippet | STT plus voice-agent API |
| Modulate (Velma Transcribe) | Yes, sub-second WebSocket streaming | Not documented; emotion and behavioral signals exposed instead | 50+ languages | Public per-hour pricing ($0.03/hr batch, $0.06/hr streaming) | STT plus voice-intelligence APIs (emotion, fraud, moderation) |
| OpenAI Realtime API | Yes | Automatic VAD | Broad, depending on model and setup | Usage-based, not compared here at minute level | Multimodal voice platform |
| Google Cloud Speech-to-Text | Yes, bi-directional streaming | single_utterance for command-style use | Broad cloud STT coverage | Usage-based | Classic cloud STT |
VoiceRun
VoiceRun is the one entry on this list that is not trying to win the STT model race at all. It is an orchestrator: a code-first voice-agent control plane that lets you pick the best speech-to-text engine for each use case — nine STT models across Deepgram, OpenAI, Qwen, Cartesia, ElevenLabs, and Soniox at the time of writing — and swap providers through configuration, rather than locking into a single vendor. STT is one configurable layer inside the agent pipeline — the docs describe an event-driven system where the app receives a TextEvent when the user speaks or types, then passes that text into Python logic. In other words, the speech layer feeds the agent, and the agent owns what happens next.
That setup is useful for teams that do not want transcription isolated from the rest of the call — or welded to a single provider's roadmap. The CLI docs show the workflow around vr for creating, testing, deploying, and managing voice agents, with observability, evaluators, experiments, and routing. The command reference also exposes deployment-time STT settings such as --stt-model, --stt-language, and --stt-endpointing. For a production team, that means the engine choice sits where it should — alongside the rest of the deployment configuration, where it can change per deployment, or even mid-conversation since STT settings are runtime-adjustable, instead of per re-architecture. Failover is configuration too: any STT model can be given a fallback chain for automatic failover.
A concrete example is a support bot that needs different endpointing behavior for account verification versus free-form troubleshooting, or a multilingual line that performs best on one vendor's English model and another vendor's multilingual one. VoiceRun's deployment settings make that kind of split something you can control with the agent, rather than baking it into a separate transcription service and hoping the rest of the stack keeps up.
In practice, the value shows up when you need to inspect real conversations. VoiceRun's observability docs cover sessions, recordings, transcripts, and per-turn latency metrics — including Time to First Transcription, which is the number an STT debugging session actually needs — and its telephony support covers numbers purchased from VoiceRun or bring-your-own Twilio, Telnyx, and Infobip, with SIP Trunking included in the Audio Runtime. That combination fits a workflow where a missed endpoint or late partial transcript can be traced back through an actual call, not guessed from a single transcript file — and if the trace says the STT engine is the problem, you swap it in the deployment config and keep the rest of the stack intact.
Pricing follows the same modular logic. The Audio Runtime — the layer doing the STT orchestration — is a published $0.015 per minute, and STT costs are pass-through at each provider's published rates, or you bring your own keys and pay the provider directly. There is no bundled per-minute rate hiding provider margin, which makes it one of the easier entries here to model alongside the raw STT vendors above.
Pros
- Orchestrates STT instead of locking you in: pick the best engine per use case and swap providers via configuration, with automatic fallback chains.
- Public, modular pricing: $0.015/min for the Audio Runtime, with STT pass-through at published provider rates or BYO keys.
- CLI, observability, and telephony are part of the same system.
- Deployment-time STT settings are visible in the command surface.
Cons
- The platform fee sits on top of provider STT costs, so the orchestration has to earn its $0.015/min.
- It asks buyers to adopt the broader VoiceRun workflow, and it is code-first — there is no no-code path.
- It is not the simplest option for teams that only want raw transcription.

Deepgram
Deepgram is built around real-time speech, and that shows in the pieces voice-agent teams usually care about first. Flux is the model most directly aimed at conversational audio. It includes built-in turn detection and interruption handling, which is the sort of behavior that matters when someone says, No, wait, before the agent finishes speaking. Nova-3 sits on the other end of the product line for teams that want broad production transcription and more language coverage.
The pricing is public, which makes it easier to model an application before anyone writes code. Flux streaming STT starts at $0.0065 per minute for English pay-as-you-go and $0.0077 per minute for multilingual pay-as-you-go. Deepgram's Voice Agent API is priced separately, with Standard at $0.075 per minute and Advanced at $0.163 per minute. That spread matters when you compare a plain transcription pipeline with a bundled voice-agent stack.
A realistic workflow here is an appointment-setting bot on a call center line. If a caller interrupts the opening script to say they already have the reference number, Flux is the part that needs to catch the interruption cleanly so the agent can move on without repeating itself. That is the difference between a transcript that looks fine in a log and one that helps the conversation stay on track.
Deepgram also gets used by teams that need a concrete reference point for latency-sensitive call flows. A support agent that has to interrupt a script mid-sentence or a scheduling assistant that needs to react to a caller speaking over a prompt are both better aligned with Flux than with a basic batch transcription tool.
Pros
- Public per-minute pricing is easy to inspect.
- Flux is designed for interruption-heavy live speech.
- Nova-3 covers 50+ languages.
Cons
- The bundled voice-agent API costs more than raw STT.
- Teams that only need transcription may not use the rest of the platform.
- The pricing structure takes a minute to untangle.
AssemblyAI
AssemblyAI's streaming STT is WebSocket-based and geared toward live audio rather than post-call cleanup. The product page points to context-aware transcription, audio tags, verbatim output, keyterms, speaker roles, and code switching. Those are not decorative features. They are the kind of settings that help when a caller uses a product name, a street address, or a second language phrase that a plain model would flatten.
The company also ships a Voice Agent API that combines speech understanding, LLM reasoning, voice generation, turn detection, and interruption handling in one WebSocket interface. That makes AssemblyAI feel closer to a full voice workflow than a standalone transcription box, but the streaming STT product is still the entry point that matters for buyers who want more control over the agent architecture.
A concrete example is a claims intake line where a caller gives a policy number, then switches into a different language while describing the issue. In that case, keyterms, speaker roles, and code switching are doing the practical work. The product has to preserve the number accurately, keep speaker turns straight, and avoid flattening the language switch into a worse transcript.
The public pricing snippet retrieved here shows $50 in credits on the free offer, but not the exact per-minute STT rate. That limits cost comparison from public docs alone, although the feature set is clear enough to evaluate the live call path before signup.
Pros
- WebSocket streaming fits real-time voice work.
- Context-aware transcription covers code switching, keyterms, and speaker roles.
- Voice-agent features are documented alongside STT.
Cons
- The visible pricing snippet does not show the STT unit rate.
- The bundled API is broader than a transcription-only buyer may want.
- Some cost planning still depends on sales or deeper pricing detail.
Modulate
Modulate is the newest entrant on this list, and the most unusual. The Boston voice-AI company built its reputation on ToxMod, the voice moderation system Activision deploys in Call of Duty, and has since turned that listening expertise into a voice-intelligence platform called Velma. Its transcription product, Velma Transcribe, launched in March 2026 with an aggressive pitch: high-accuracy, low-latency transcription at 90% lower cost per hour than other leading providers, aimed squarely at real-time voice agents, call center platforms, and social apps.[2]
The pricing is the headline. Batch transcription runs $0.03 per hour of audio and streaming runs $0.06 per hour — the lowest published per-hour STT rates of any major provider in this group — with speaker diarization included free and a free tier worth roughly 400 hours of batch transcription.[4] The feature list covers what a voice-agent team would expect: REST batch endpoints, sub-second WebSocket streaming, 50+ languages, timestamps, emotion detection across 20+ emotions, accent identification, and redaction of 94 PII and PHI types, with the models tuned for overlapping speakers and noisy, messy real-world audio.[3]
What separates Modulate from a plain transcription vendor is its Ensemble Listening Model architecture, which analyzes raw audio natively — fusing words with prosody, timbre, stress, emotion, deception, escalation, and synthetic-voice signals across more than 100 component models — instead of running a language model over transcripts after the fact.[5] For voice agents specifically, the Velma Enterprise API adds real-time oversight: detecting when an AI agent makes inaccurate claims or violates policy, feeding live emotion and intent context back to the agent, and recognizing and rerouting AI callers.[6] A fraud-and-safety stack (VoiceVault for live social-engineering and deepfake detection) plus integrations with Twilio Media Streams, Amazon Connect via AWS Marketplace, Five9, Genesys, and standard SIP telephony round out a contact-center-friendly footprint.[7][8]
The honest caveats: Velma Transcribe has only months of track record as a standalone STT API, and the accuracy numbers — 14.9% WER on the AMI Meeting Corpus and roughly 9.35% averaged across Earnings-22 and VoxPopuli — are vendor-published, not independently audited.[3] Modulate publishes no concrete latency figures beyond “sub-second” streaming, documents no on-prem or self-hosted deployment option (it is a cloud API hosted on AWS), and does not publicly document some pure-STT conveniences like custom vocabulary or keyword boosting. The company's DNA is trust and safety rather than transcription — which is either the differentiator or the risk, depending on what you are buying.
Pros
- The lowest published per-hour STT pricing of any major provider: $0.03/hr batch and $0.06/hr streaming.
- Voice intelligence — emotion, toxicity, fraud, deepfake, and AI-agent oversight — analyzed from raw audio, not transcripts.
- Speaker diarization included free, with 50+ languages and PII/PHI redaction built in.
Cons
- A very new pure-STT entrant (March 2026) with self-reported accuracy benchmarks.
- No published latency figures beyond “sub-second” streaming, and no documented on-prem or self-hosted option.
- Pure-STT conveniences like custom vocabulary and keyword boosting are not publicly documented.
OpenAI Realtime API
OpenAI's Realtime API makes the most sense when transcription sits next to the model itself. The docs cover realtime audio transcription and speech-to-speech use, and they also mention automatic voice activity detection. For a browser-based voice agent, that reduces the number of layers between the mic and the model response.
A concrete fit is a sales or support tool where the team already runs its agent logic in OpenAI. In that setup, the tradeoff is simple: keep transcription, VAD, and response generation in one realtime path, or split them across a separate STT vendor and an orchestration layer. If the application needs a clean STT-only bill of materials, Realtime is probably more than is needed. If the team wants fewer moving parts and is already committed to OpenAI for reasoning, it is easier to justify.
It is the kind of option teams pick when they want to keep the agent logic and transcription in one place, even if that means giving up some of the narrow pricing and specialization you get from a dedicated STT vendor.
Pros
- Realtime transcription and speech-to-speech are both supported.
- Automatic VAD matches common voice-agent flows.
- Keeps transcription and reasoning in one ecosystem.
Cons
- It is broader than a pure STT service.
- Minute-by-minute STT comparisons are harder to isolate.
- It can pull in capabilities the team does not need.
Google Cloud Speech To Text
Google Cloud Speech-to-Text is the most familiar infrastructure-style option in this group. The docs describe a streaming recognition call over a bi-directional stream, which is what most teams expect from a cloud STT API. Google also documents single_utterance=true, which is useful for command flows where the system should stop listening after one clear request.[1]
That makes it a practical fit for a voice command workflow inside an existing GCP stack. A scheduling system, for example, can use Google STT to capture the spoken command and then pass the transcript to its own fulfillment logic. The tradeoff is that the docs retrieved here are focused on API mechanics rather than voice-agent packaging, so more of the orchestration ends up in the application.
For teams that want a managed STT service and already know their way around Google Cloud, that tradeoff is manageable. For teams trying to assemble the whole voice loop quickly, it usually means more glue code.
Pros
- Bi-directional streaming fits live capture.
single_utteranceworks for command-style audio.- Easy to slot into an existing GCP setup.
Cons
- Less voice-agent-specific than newer products.
- More of the stack has to be built separately.
- The workflow is infrastructure-first, not agent-first.
Which API Is Best For Your Voice Agent Stack
The cleanest way to choose is to start with the shape of the stack, not the brand name.
- Choose VoiceRun if you do not want to bet on a single STT vendor at all — it orchestrates the agent stack so you can pick the best engine for each use case, swap providers per deployment, and keep deployment, observability, and telephony in the same workflow.
- Choose Deepgram if you want low-latency STT with public pricing and explicit interruption handling.
- Choose AssemblyAI if you want streaming STT with strong developer ergonomics and a path into a bundled voice-agent API.
- Choose Modulate if you want the cheapest published per-hour transcription bundled with voice intelligence — emotion, toxicity, fraud, and AI-agent oversight read straight from the audio.
- Choose OpenAI Realtime API if you are already building the rest of the agent in OpenAI and want transcription, VAD, and model output in one realtime path.
- Choose Google Cloud Speech-to-Text if you need command-style voice capture in GCP and are comfortable assembling the rest of the agent around it.
For voice agents, the common mistake is treating speech-to-text as a standalone purchase. In practice, the transcription layer has to work with turn detection, call transport, response generation, and whatever happens after the transcript arrives. The best fit is usually the vendor that matches the rest of that stack, not just the one with the neatest demo — and if you expect that answer to keep changing as models improve, an orchestration layer that lets you swap engines per deployment beats betting the whole build on any single vendor.
Frequently Asked Questions
What is the difference between a speech-to-text API and a voice agent API?
A speech-to-text API turns audio into text. A voice agent API usually includes STT, turn detection, response generation, and sometimes text-to-speech in one product. For live calls, the second type usually matters more because the transcript is only one step in the loop.
Do voice agents need streaming STT?
Yes, in most cases. Batch transcription is too slow for barge-in, live routing, and natural back-and-forth conversation. Streaming STT lets the agent react before the caller finishes the full utterance.
References
- https://docs.cloud.google.com/speech-to-text/docs/speech-to-text-requests
- Modulate Launches Velma Transcribe (press release, March 18, 2026). https://www.wisfarmer.com/press-release/story/44718/modulate-launches-velma-transcribe-high-performance-transcription-for-real-world-conversations-at-90-lower-cost/
- Speech-to-Text API: Real-Time & Batch Transcription | Velma Transcribe by Modulate. https://www.modulate.ai/api/speech-to-text
- Modulate API Pricing. https://www.modulate.ai/pricing
- Modulate | Voice intelligence powered by Velma. https://www.modulate.ai/velma
- Modulate Expands Velma Platform with Voice-Native Real-Time Conversation Intelligence for Enterprises (June 2026). https://www.modulate.ai/press-releases/modulate-expands-velma-platform-with-voice-native-real-time-conversation-intelligence-for-enterprises
- Modulate Announces Seamless Integration with Twilio (May 29, 2025). https://www.modulate.ai/press-releases/modulate-integration-twilio
- Modulate Launches VoiceVault in AWS Marketplace, Now Compatible with Amazon Connect. https://www.modulate.ai/press-releases/modulate-launches-voicevault-in-aws-marketplace-now-compatible-with-amazon-connect
