Office Hours — What's the best approach for real-time, local text-to-speech when building live voice interaction features?

What’s the best approach for real-time, local text-to-speech when building live voice interaction features?

Real-time TTS is the constraint that kills most voice projects. You’re caught between latency (users hate waiting for speech to start), quality (robotic voices tank engagement), and resource footprint (running inference on-device is brutal). The approach splits into two camps depending on where your users are.

On-Device vs. Streamed: The Actual Tradeoff

On-device inference buys you zero network latency and offline capability, but the models that fit in memory are genuinely bad. Streaming from a server gives you better quality voices and lower resource cost, but you’re betting on network consistency and paying for bandwidth per request. Most teams building live voice interaction end up hybrid: stream for quality voices in normal conditions, fall back to lightweight on-device for reliability when the network hiccups.

The math is blunt. A decent streaming TTS service costs somewhere in the ballpark of a few dollars per 1M characters. On-device models like glow-tts or fastspeech2 run at inference cost only but produce noticeably flatter, less natural speech that users perceive as less trustworthy in a voice interaction. For a production voice agent that needs to sound competent, streaming wins unless you’re explicitly targeting offline-first use cases.

Streaming TTS in Production

If you’re building a live voice interaction feature (conversational AI, live transcription follow-up), use a streaming TTS API with chunked output. The pattern is: get a token from your LLM, start streaming audio for that token while the model generates the next one. Google’s Gemini 3.1 Flash TTS integrates directly into their inference pipeline for this reason, and it works. You’re not waiting for the entire response to generate before audio starts playing.

The tricky part is handling interruption. Users in voice interfaces expect to cut off the AI mid-sentence. This means you need to buffer audio output separately from generation, which is fine until the network hiccups and you’ve queued 3 seconds of speech that’s now stale. Real systems add a jitter buffer and discard pending audio when the user interrupts, then immediately start generating new speech for the new context.

Here’s a concrete pattern:

# Streaming TTS with interrupt handling
class StreamingVoiceOutput:
    def __init__(self, tts_client, buffer_size_ms=500):
        self.tts = tts_client
        self.audio_queue = asyncio.Queue()
        self.buffer_time = buffer_size_ms / 1000
        self.current_task = None
    
    async def stream_speech(self, text_chunks):
        """Stream TTS for incoming text chunks from LLM"""
        async for chunk in text_chunks:
            # Fire off TTS request, don't wait for full audio
            task = asyncio.create_task(
                self.tts.stream_bytes(chunk)
            )
            self.current_task = task
            async for audio_frame in task:
                await self.audio_queue.put(audio_frame)
    
    async def interrupt(self):
        """Clear pending audio, stop current TTS task"""
        if self.current_task:
            self.current_task.cancel()
        # Drain queue
        while not self.audio_queue.empty():
            try:
                self.audio_queue.get_nowait()
            except asyncio.QueueEmpty:
                break

This is barebones but captures the core: you’re streaming audio chunks into a queue while the LLM is still generating, and when the user interrupts, you nuke pending audio and signal the LLM to stop. The latency between user interrupt and audio cutoff matters here. Aim for under 200ms.

Network Fallback and Degradation

Real-time voice apps fail spectacularly if they can’t gracefully degrade. If your streaming TTS endpoint goes down, your voice interaction becomes unusable. Use a fallback strategy: primary streaming service (Google, OpenAI API, or whatever), secondary lightweight on-device model, tertiary text-only output if all else fails.

Network latency itself is the killer. If you’re in a market with flaky connectivity, even a fast streaming API becomes a bottleneck. Some teams prebuffer TTS for common responses (greetings, confirmations, error messages). It’s not elegant but it cuts latency to zero for the most frequent paths.

Local Models and When They Make Sense

If you absolutely need on-device, use Piper (open-source, trained on public data, reasonable quality for a 200MB footprint) or KoKoro (newer, better voices). Both run at inference time only. Piper gives you maybe 3-4 seconds of speech per second of compute on mobile hardware. KoKoro is faster but still not real-time at high quality.

The honest take: local TTS makes sense for offline-first apps (maps, accessibility features without internet), but if you have network access, streaming will always sound better and cost less to operate. The engineering complexity of fallback isn’t worth saving the bandwidth cost unless you’re in a very high-volume scenario.

Configuration for Low-Latency Streaming

If you go streaming, configure aggressively for latency:

Stream in short chunks (250ms of audio target, not full sentences)
Use the lowest-latency voice option the provider offers (they all have a speed vs. quality slider)
Enable audio streaming at the API level (byte-by-byte, not waiting for buffer fill)
Set a hard timeout on TTS requests (if it takes >2 seconds to get the first audio frame, fall back immediately)
Use connection pooling and keep-alive to avoid TCP handshake overhead on every request

Google Gemini 3.1 Flash TTS has native support for this pattern. OpenAI’s API also supports streaming, though integration is a bit more manual. Anthropic doesn’t have native TTS, so you’d layer in a service yourself.

The Compute Cost Reality

At scale, streaming TTS is cheaper than you’d expect. A typical voice interaction generates maybe 500-1000 characters of TTS output per user session. If you’re running millions of sessions, the TTS cost is real but usually dwarfed by LLM inference cost. The bandwidth for audio is also manageable—audio streaming typically compresses to 10-20KB per second of speech.

Bottom line: Use streaming TTS from a reliable provider (Google Gemini 3.1 Flash TTS, OpenAI API, or similar) with aggressive chunking and a lightweight on-device fallback for network resilience. On-device-only is fine only if offline is a hard requirement; otherwise you’re trading obvious audio quality for marginal cost savings.

Question via Hacker News