AI Engineer
Mira — Voice AI Agent
Autonomous Voice AI Agent with ultra-low latency
<2s
Latency
1200ms
Speedup
43%
TTS Improv.
The Problem
Traditional voice AI agents suffer from high latency due to HTTP request/response overhead and sequential STT -> LLM -> TTS processing paradigms, leading to unnatural, turn-based conversations and poor user experience, compounded by hallucinations in domain-specific tasks.
Constraints
- 1Ultra-low latency requirement (<2s total round trip)
- 2Seamless interruption handling (barge-in)
- 3Complex context management with Retrieval-Augmented Generation
- 4Scalable infrastructure with robust orchestration
Approach
Eliminated HTTP hop overhead by implementing direct STT and TTS API integrations over a custom binary WebSocket protocol. Implemented speculative RAG pre-fetching to perform similarity searches concurrently with STT processing, using Qdrant and local ONNX embeddings to effectively bypass LLM latency penalties for retrieval tasks.
Architecture
The real-time orchestration backend is built with TypeScript and Fastify within a Turborepo monorepo, achieving maximum throughput. Infrastructure is provisioned via Terraform on AWS (ECS, ALB). The AI core relies on Gemini 2.5 Flash for rapid inference, enhanced by Qdrant for distributed vector search and in-memory local ONNX models (all-MiniLM-L6-v2) for low-latency embedding generation without external network calls.
Results
Achieved sub-2-second latency for complex voice interactions, saving 1200ms round-trip time through speculative RAG and removing secondary LLM calls. Reduced TTS generation latency by 43% via a parallel queue and smart sentence buffering, delivering a highly natural, interruption-capable user experience.
Learnings
- 1
Binary WebSocket protocols with custom headers significantly outperform HTTP polling for real-time voice streaming and orchestration.
- 2
Speculative execution in RAG (e.g., retrieving context during TTS playback or initial STT decoding) can drastically reduce perceived latency.
- 3
Parallel TTS processing and chunked audio buffering are critical for mimicking natural human delivery cadence.