The goal — what I'm exploring
Echo is my deep-dive into the part of voice AI that isn't the model — the ~800 milliseconds of perceived latency and the barge-in turn-taking. The question I wanted to answer: what does it actually take to make a spoken AI conversation feel alive instead of like a 1970s answering machine? The finding that made it click is that the fix is structural — overlap the pipeline stages with a sentence chunker, model turn-taking as an explicit state machine — not "buy a faster model."
How it uses AI
Echo runs on Google Gemini. The text-and-tools model (gemini-3.1-flash-lite by default) streams tokens over Server-Sent Events and calls functions — weather, time-zone, web search — mid-conversation. Speech-to-text and text-to-speech run in the browser via the Web Speech API, so the model is the only thing that ever touches the network. A second "Live" engine swaps in Gemini's native Live API for full speech-to-speech over a WebSocket. The model is the easy part; the orchestration around it — the sentence chunker that starts talking on the first sentence, the turn-taking state machine, the barge-in handling — is where the real engineering lives.
How it works
- Closes the whole spoken loop — mic → transcribe → think → speak — and lets you cut Echo off mid-sentence. Barge-in interruption is what makes it feel like a conversation instead of dictation.
- The hard part is the ~800ms latency budget, not the model: a sentence chunker starts text-to-speech on the first complete sentence while the rest of the reply is still generating.
- Gemini answers stream token-by-token over Server-Sent Events and can call weather, time-zone, and web-search tools mid-reply.
- Turn-taking is a single explicit state machine (idle → listening → thinking → speaking → barge-in), so the many out-of-order async voice events become no-ops instead of bugs.
- Two voice engines behind one toggle: a hand-built Classic pipeline (browser STT/TTS) and Gemini’s native Live API — PCM speech-to-speech over WebSocket, connected with single-use ephemeral tokens.
- Zero-setup demo on a shared key with a usage meter, plus bring-your-own-key for unlimited use; STT/TTS run in the browser, so no audio ever leaves your machine.
