Latency Metrics

The Agent Debugger surfaces five per‑turn latency metrics that capture responsiveness from the moment a user speaks to when audio plays back. View them in the "Turn Latencies" tooltip. Among them, the End-to-End Turn Taking ★ is the most user‑perceptible metric.

Agent Debugger — Turn Latencies tooltip

Agent Debugger: Turn-by-turn view with the "Turn Latencies" tooltip.

Time to First Transcription

Time from when the user starts speaking until the first interim or final transcription arrives from STT. Independent of End‑to‑End Turn Taking and does not sum into it. The timer persists across interruptions within the same turn and resets at turn boundaries and session stop.

How it's measured: Timer starts on speech start; records on the first InterimTranscriptionFrame or TranscriptionFrame received. The timer survives interruptions (barge‑ins) within the turn so the first transcription still counts; it resets at turn end and session stop to prevent cross‑turn contamination. Safeguards skip negative latency values; if no transcription arrives before the turn ends, this metric is omitted for that turn.

Why it matters: Fast first transcription enables quicker backchanneling and earlier LLM decisions. Accurate measurement ensures reliable latency tracking across different conversation patterns.

Time to First Speech Event

Latency from the handler receiving the user input (TextEvent) to the first speech-producing event from the handler.

How it's measured: Captured on the first Text-to-Speech event produced by your handler each turn.

Why it matters: Good proxy for LLM/handler prompt+thinking time before speaking starts. Component of End‑to‑End Turn Taking.

Time to First Audio

Time from TTS start to the first audio frame streamed to the listener.

How it's measured: Starts at TTS start; recorded on the first TTS audio frame for the turn.

Why it matters: Indicates TTS startup/streaming latency that affects perceived snappiness. Component of End‑to‑End Turn Taking.

End-to-End Turn Taking ★

The overall time from when the user finishes speaking to the first audio frame streamed to the listener.

How it's measured: Starts at speech stop; ends at the first TTS audio frame emitted to the listener.

Why it matters: Represents perceived responsiveness after a user finishes talking. This is the most user‑perceptible metric. Approximate relation: End‑to‑End Turn Taking ≈ Time to First Speech Event + Time to First Audio (+ small pipeline/transport overhead).

Function Runtime

Duration of the handler's work for the turn (end-to-end function time).

How it's measured: Recorded when the turn ends, shown as the duration chip in the debugger.

Why it matters: Helps identify slow logic, blocking I/O, or long-running tool calls.

Tips to Improve Latency

  • Use streaming TTS and stream partial LLM responses (speak as you think).
  • Trim prompts and tool output; cache static opening lines with TTS cache.
  • Choose STT models optimized for first token speed if backchanneling is important.
  • Avoid long blocking I/O in your handler; make external calls concurrent when possible.