You Don't Actually Want Speech-to-Speech

We've all seen the demos. GPT-4o definitely wasn't meant to be Scarlett Johansson. Sesame's Maya & Miles respond in 300 milliseconds. Everyone's convinced that speech-to-speech is the future of voice agents.

I used to think so too. I watched those demos and felt the same thing you probably felt, the “this changes everything” feeling that is so common in the AI era. But after building voice agents in production for a while, I've changed my mind. This isn't just cope. If you look carefully at what actually impressed you in those demos, you'll see why.

What you actually like is the turn-taking

The thing about S2S demos that feels magical is the turn-taking. The model knows when you're done talking. You pause, and it waits. You complete a thought and it's quick to respond. It feels far closer to how we communicate. But that's not an exclusive attribute of speech-to-speech. It's an attribute of the semantic VAD and turn-taking with fast response time.

Naive voice activity detection is simple. It listens for silence and starts a countdown. Three hundred milliseconds of quiet, and the system is going to start trying to respond to you. Increase that 300ms delay to 1.5s and now your system takes forever to respond. If you design to allow natural pauses, you are slow to respond when the user is done. If you design to respond in a snappy manner, then your agent interrupts the user during a natural pause.

Semantic VAD is smarter. It uses language understanding, not just silence detection, to predict whether the speaker has actually finished their turn. These are always used with S2S, but only sometimes used with cascade architectures. Deepgram, Gradium, Soniox, 11Labs, OpenAI all ship semantic endpointing on their streaming STT. Pipecat opensources SmartTurn. It's a feature you can bolt onto any cascade pipeline, but for some reason, many voice agents lag in adoption.

There is no reason your cascade architecture can't be as smart about turn-taking as S2S.

Someone might say “well, what about emotional understanding?” S2S models can read tone and sentiment directly from the audio. That's real. But you can get sentiment analysis from other, cheaper models in real time and feed that context into your cascade pipeline. You don't need a monolithic multimodal model to be emotionally aware.

So then, the only edge from S2S is the latency? Well, there are answers to that, too.

Well-designed cascade systems are fast

The strongest argument for S2S has always been speed. Audio capabilities and intelligence all in a single model. Audio in, audio out. And it's true that a naive cascade pipeline is slow. STT to LLM to TTS, each with its own network round-trip, can easily take 2–4 seconds. No user wants that. Even co-locating models is only a marginal improvement at best. But the serious builders I see on VoiceRun aren't running a naive cascade.

There's a principle in engineering that applies here: when possible, compute at build time, not runtime. Pre-compute your most common responses: greetings, confirmations, clarifying questions, acknowledgments. Cache the audio. String-match on the transcription. When a caller says “yes,” you don't need to round-trip through an LLM to know what comes next. Despite what LLM-only voice agent platforms will tell you, you can do this while maintaining a compelling and natural conversation.

In some use cases deployed on VoiceRun, precomputed responses mean we skip the LLM entirely up to 85% of the time. No inference. No audio generation. The response is near-instant and cheaper. So instant, you may even want to add a pause to make it sound more natural.

Sometimes comparing S2S vs cascade means managing tradeoffs. In these cases, cascade wins on reliability, latency, and cost simultaneously. And even during that other 15% you can still fall back to an LLM with an optimized prompt, stream TTS as tokens generate, and achieve sub-second response times.

Text is a feature, not a bug

When I believed in S2S as the future, it was because I saw the text layer as overhead. But I've come to learn that it's actually critical for many use cases.

The text layer enables you to inspect, filter, and modify what your agent says before it speaks. With S2S, audio goes directly from model to user. No programming language is designed to operate on audio waves without a transcription step. You cannot apply logical code to S2S through any method known to me. This means your guardrails and code-based performance optimizations are all lost. You're stuck with audio and, for now, inconsistent tool-calling.

Optimizing over the long haul is difficult with S2S. It's difficult to provide a harness around a S2S model. The only optimization vector is system prompt tuning. Naive voice agent platforms in cascade architectures that use an LLM-only intelligence layer suffer this same limitation. In sophisticated cascade systems that use a hybrid LLM + code intelligence layer (like VoiceRun's), it's easy to analyze what went wrong (through transcripts, traces, and logging) and remedy with real treatments in code.

Then there's tool use. Your agent needs to check a database, call an API, look up a policy. Text makes this trivial. Structured text is how every tool in your stack expects to communicate. S2S is improving here, but it's still way behind. And of course, the obvious enterprise consideration: healthcare, finance, and insurance need auditable logs of what your agent said and why. Text gives you that for free.

The pull of efficiency

There's also the question of cost. S2S models are expensive to run. You're pushing every utterance through a massive multimodal model. At scale, it compounds fast.

But the more interesting argument is the long-term one. All things held equal, markets prefer compute efficiency. That's basically the history of infrastructure: doing more with less. Cascade lets you right-size each component independently. Swap in a faster STT, a cheaper LLM, a more efficient TTS, without rebuilding your whole pipeline. S2S is monolithic. You get what you get.

As each component gets cheaper and faster, cascade's modular advantage compounds. You upgrade one piece and keep the rest. That's how winning architectures tend to work.

The bottom line

Speech-to-speech will have its place. Creative applications, emotional companions, consumer experiences where feel matters more than reliability are great places for S2S.

But if you're building voice agents that need to work reliably, controllably, and cost-effectively, cascade is the architecture. And if your cascade feels slow or awkward, the problem isn't the architecture. The problem is the models and design you're putting into it.

Nick Leonard
CEO & Co-Founder, VoiceRun