Skip to main content
Back to projects

Realtime Voice Agent

Echo

Echo

The goal — what I'm exploring

Echo is my deep-dive into the part of voice AI that isn't the model — the ~800 milliseconds of perceived latency and the barge-in turn-taking. The question I wanted to answer: what does it actually take to make a spoken AI conversation feel alive instead of like a 1970s answering machine? The finding that made it click is that the fix is structural — overlap the pipeline stages with a sentence chunker, model turn-taking as an explicit state machine — not "buy a faster model."

How it uses AI

Echo runs on Google Gemini. The text-and-tools model (gemini-3.1-flash-lite by default) streams tokens over Server-Sent Events and calls functions — weather, time-zone, web search — mid-conversation. Speech-to-text and text-to-speech run in the browser via the Web Speech API, so the model is the only thing that ever touches the network. A second "Live" engine swaps in Gemini's native Live API for full speech-to-speech over a WebSocket. The model is the easy part; the orchestration around it — the sentence chunker that starts talking on the first sentence, the turn-taking state machine, the barge-in handling — is where the real engineering lives.

How it works

  • Closes the whole spoken loop — mic → transcribe → think → speak — and lets you cut Echo off mid-sentence. Barge-in interruption is what makes it feel like a conversation instead of dictation.
  • The hard part is the ~800ms latency budget, not the model: a sentence chunker starts text-to-speech on the first complete sentence while the rest of the reply is still generating.
  • Gemini answers stream token-by-token over Server-Sent Events and can call weather, time-zone, and web-search tools mid-reply.
  • Turn-taking is a single explicit state machine (idle → listening → thinking → speaking → barge-in), so the many out-of-order async voice events become no-ops instead of bugs.
  • Two voice engines behind one toggle: a hand-built Classic pipeline (browser STT/TTS) and Gemini’s native Live API — PCM speech-to-speech over WebSocket, connected with single-use ephemeral tokens.
  • Zero-setup demo on a shared key with a usage meter, plus bring-your-own-key for unlimited use; STT/TTS run in the browser, so no audio ever leaves your machine.

The stack

Realtime voice UI — a turn-state orb, streamed live captions, 6 personas, and a first-class typed fallback
Next.js 16 App Router on React 19, end-to-end type-safe
React