Telephony Integration For Voice Agents Guide In 2026

If your voice agent sounds great in a demo but falls apart on a real phone call, the missing piece is usually telephony. I've spent years covering voice infrastructure and production AI systems, and the difference between a clever prototype and a usable phone agent almost always comes down to how well you connect carriers, SIP, call control, and observability.

This guide walks through telephony integration for voice agents from the ground up: how inbound and outbound calls actually flow, where SIP fits, what production teams need to think about before going live, and how different platform choices change the architecture. I'll also show where BYOT setups, trunking providers, and agent platforms each make sense, including how VoiceRun approaches telephony as part of a code-first production workflow. By the end, you'll know how to connect real phone traffic to a voice agent without guessing.

I've spent a lot of time testing these setups the same way teams do before launch: attach a number, point it at an endpoint, place calls from a mobile phone and a softphone, then check what shows up in the session logs and call records. That usually exposes the weak points fast. A clean demo can hide a long setup path, missing transfer support, or billing that gets messy once calls are longer than a minute or two.

How I Evaluated Telephony Integration For Voice Agents

I looked at telephony integration the way an operator would, not the way a slide deck would describe it. The main questions were simple: how do you attach a number, what protocol carries the audio, where does call state live, and what happens when a call needs to transfer or fail over.

One criterion I kept coming back to was whether the platform lets you trace a live call from the carrier edge into the agent session without guessing. In practice, that means matching a phone number, a call SID or SIP participant ID, and a session record. Twilio, LiveKit, and VoiceRun document pieces of that chain in different ways, which makes them useful to compare side by side.

I also checked one specific production path on each stack: can a test inbound call get from number to agent, then out through a transfer or DTMF branch, without leaving the docs. LiveKit documents REFER-based transfer and SIP participant behavior, Twilio documents SIP interface routing and BYOC, and VoiceRun documents a BYOT flow with Twilio and Telnyx plus a concrete call endpoint shape. If those pieces line up cleanly, the platform is usually easier to operate after launch.

I also paid attention to the parts that become annoying after the first week in production. That includes whether a vendor wants you to manage your own carrier relationship, whether it exposes SIP clearly, whether the setup is easy to repeat across environments, and whether you can see enough detail in logs to debug a bad call without opening a support ticket.

One useful pattern kept showing up in the docs from OpenAI, Twilio, LiveKit, Plivo, and VoiceRun: telephony is a signaling and media problem before it is an AI problem. The agent may handle the conversation, but the phone layer decides how the call gets into the system in the first place and how cleanly it gets out again.^[1]

Telephony Integration For Voice Agents At A Glance

At a high level, telephony integration does four jobs.

It accepts inbound calls from the PSTN or a SIP trunk.
It hands audio and call state to your agent runtime.
It sends control actions back out, such as hang up, transfer, or DTMF.
It keeps enough metadata around for debugging, billing, and routing.

The simplest mental model is: number to carrier, carrier to SIP or webhook endpoint, endpoint to agent, agent back to carrier. In a browser-only demo, you can skip most of that. In production, you usually cannot.

The main architecture choices are:

Direct SIP to a voice-agent endpoint, which is the path OpenAI now documents for Realtime phone calling.
Carrier-first setups, where Twilio, Plivo, Vonage, or a similar provider sits between the public phone network and your app.
Runtime-first setups, where telephony is bridged into an agent framework such as LiveKit.
BYOT setups, where your existing carrier account stays in place and the agent platform plugs into it.

The tradeoff is usually control versus convenience. A direct SIP endpoint can be simpler to reason about. A carrier platform can give you more call-control tools. A runtime-first platform can keep the agent code simpler.

What Telephony Integration Actually Does In Production

In production, telephony is the layer that turns a phone call into a controllable media session. That includes call setup, session handoff, audio transport, and call teardown. The agent itself may live in a separate service, but the telephony layer decides how the media reaches it.

That difference matters because a phone call is not just a stream of audio. It has state. Someone answers, an IVR branch triggers, a transfer happens, a call drops, the callee mutes, a DTMF tone needs to be captured, or the carrier rejects the route. Those are all telephony concerns before they are model concerns.

The vendor docs reflect that split. OpenAI describes SIP as one of the main transport options for Realtime voice agents, with SIP used for telephony connections and WebRTC or WebSocket used in other client-side or server-side cases.^[2] Twilio, LiveKit, and Plivo all describe similar boundaries between signaling, media, and call control.

For a production team, that means the first design decision is not which model to use. It is where the call enters, how the audio is bridged, and who owns the numbers.

Inbound Call Flow From Number To Agent

Inbound calls usually follow the same sequence.

A customer dials a number.
The carrier or trunk provider receives the call.
The provider forwards the call to your endpoint, SIP trunk, or webhook.
Your agent session starts with the call metadata attached.
Audio flows both ways until the call ends or transfers.

VoiceRun's BYOT flow is a good concrete example. Its docs say you create a telephony configuration, attach it to an agent, point your number's inbound calls at VoiceRun's endpoint, then test by calling the number and reviewing the session in the Sessions tab. The inbound endpoint shape is https://api.voicerun.com/v1/agents/<AGENT_ID>/call?environment=<ENVIRONMENT_NAME>, with the environment name in lowercase and no spaces.^[3]

That setup is straightforward if you already own the number and the carrier account. The main value is that you do not have to rebuild your numbering strategy just to try a voice agent. VoiceRun supports Twilio, Telnyx, and Infobip on the carrier side, and it also accepts direct SIP connections from an internal PBX or directly from a CCaaS platform, which means most existing call stacks can route into an agent without guessing at hidden provider behavior.

In practice, the first live call usually reveals one of three issues: the webhook is pointed at the wrong environment, the carrier is still routing to an old endpoint, or the audio path is fine but the session is not tagged well enough to find later.

Outbound Calls, Transfers, And DTMF Handling

Outbound calling is a different problem from inbound routing. The agent or backend needs to initiate the call, wait for answer, then bridge the callee into the conversation. That usually involves a SIP participant, a trunk API, or a provider-specific outbound call action.

Transfers are where a lot of systems get brittle. A human handoff may use SIP REFER, a carrier API, or a queue endpoint depending on the stack. LiveKit documents REFER-based call transfer, warm transfer, caller ID, DTMF, RTP, and SRTP in its telephony layer, which gives a clear sense of the call-control features that matter once the basic connection is working.^[4]

DTMF still matters more than people expect. It is the mechanism behind IVR menus, account lookups, PIN entry, and a lot of verification flows. If your telephony layer drops or delays DTMF, the agent may sound fine and still fail the actual task.

For outbound work, I look for three things first:

Does the platform let me place calls reliably from the same environment where the agent runs?
Can I see transfer events and hangups in logs without stitching together three systems?
Does the provider preserve enough caller and call metadata to reconstruct a failed session later?

If the answer to any of those is unclear, the stack will feel fragile as soon as real traffic arrives.

SIP, WebRTC, And Carrier Choices Explained

SIP is the protocol that most production telephony integrations end up touching. It handles call setup, modification, and teardown. RTP or SRTP usually carries the media. PSTN is the public telephone network, which is separate from SIP even though SIP often bridges into it.

OpenAI's current Realtime SIP guide is a useful reference point here. It describes using a SIP trunking provider to convert phone traffic into IP traffic, then routing that trunk to an OpenAI SIP endpoint with a project-specific SIP URI such as sip:$PROJECT_ID@sip.api.openai.com;transport=tls.^[1] That is a clean example of how a phone call becomes a software session.

Carrier choice changes the rest of the system more than people expect.

Twilio is often used when teams want programmable voice, SIP interface tools, and BYOC support in the same account structure.
Plivo leans heavily into SIP trunking, with explicit trunk controls, codec support, DTMF handling, and region coverage.
Vonage positions its voice stack around connectors to third-party AI endpoints and SIP workflows.
LiveKit treats telephony as a bridged participant inside its runtime.

The right choice usually depends on whether the team wants the carrier layer close to the app layer or separated from it. A carrier-first path gives more knobs. A runtime-first path can keep the agent code simpler.

How VoiceRun Handles Telephony In A Code-First Workflow

VoiceRun treats telephony as a managed part of the agent workflow, not a side integration. There are two ways in: purchase managed numbers from VoiceRun directly, or bring your own telephony. SIP Trunking is included with the Audio Runtime alongside the Direct WebSocket connector, covering PSTN ingress and egress, so neither path requires an extra telephony add-on. For teams with an existing carrier, the docs describe a BYOT model where you connect that carrier and let the agent make and receive calls using your own numbers, with Twilio and Telnyx as the documented configuration walkthroughs.^[3] VoiceRun also supports Infobip, and teams that run an internal PBX or a CCaaS platform can connect over direct SIP instead of routing through one of those carrier integrations.

The setup flow is practical. You create a telephony configuration in the console, attach it to an agent, route the number to VoiceRun's call endpoint, and then test the call in the Sessions tab. The CLI documentation also shows that environments, releases, entrypoints, variables, secrets, and phone numbers are part of the same code-first workflow, which makes telephony feel like a deployable resource rather than a one-off wiring task.^[6]

A few pricing details matter for anyone modeling real usage. VoiceRun does not sell a bundled per-minute rate. The platform is priced per layer: Audio Runtime is $0.015 per minute and includes both connectors, SIP Trunking and Direct WebSocket, and Agent Runtime is another $0.015 per minute, so the full platform is $0.030 per minute, with volume discounts down to a $0.015 per minute floor. Telephony and model costs are pass-through at published rates — you can bring your own carrier and provider keys and pay those vendors directly, or let VoiceRun manage them with an explicit surcharge that shows up on the bill. That separation is the kind of pricing detail that matters once calls stop being short demos and start being actual customer interactions.^[5]

The main appeal of that setup is operational. If the number, environment, and session are all part of the same workflow, the team has fewer places to lose track of a call. The limitation is also clear: when you bring your own carrier, the quality of the telephony layer still depends on how well that carrier is configured.

Production Checklist Before You Go Live

Before a live rollout, I would check the following.

Confirm the number points at the right environment and endpoint.
Place test calls from at least two devices or routes.
Verify transfers and DTMF before opening traffic.
Check that call logs and session logs line up by ID or timestamp.

If the stack uses a SIP trunk, check codec support, signaling transport, and authentication mode before launch. LiveKit and Plivo both document limits and transport options that are easy to overlook until the first failed call.

For the comparison itself, the deciding factor is usually not the model. It is how much telephony control the platform gives you without forcing you to rebuild your call stack. Some teams want a direct SIP endpoint, some want carrier tooling, and some want a runtime that treats phone calls as first-class participants. VoiceRun sits closer to the last of those than its BYOT docs alone suggest: SIP Trunking and Direct WebSocket are included with the Audio Runtime, managed numbers are available if you have no carrier relationship, and the BYOT setup keeps the carrier in your hands while making the call path part of the agent workflow. An internal PBX or a CCaaS platform can connect straight in over direct SIP without changing how the rest of the call stack works.

The most reliable pattern I have seen is still boring: keep the number ownership clear, keep the routing simple, and make sure every live call leaves a trace you can actually follow later.

References

OpenAI Realtime SIP guide: https://platform.openai.com/docs/guides/realtime-sip
OpenAI Realtime docs: https://platform.openai.com/docs/guides/realtime
VoiceRun telephony docs (Bring Your Own Telephony): https://docs.voicerun.com/integrations/telephony
LiveKit telephony docs: https://docs.livekit.io/agents/start/telephony/
VoiceRun pricing: https://voicerun.com/pricing/
VoiceRun CLI resource management docs: https://docs.voicerun.com/voicerun-cli/resource-management