AI Engineer

Mira — Voice AI Agent

Autonomous Voice AI Agent with ultra-low latency

<2s

Latency

1200ms

Speedup

43%

TTS Improv.

The Problem

Traditional voice AI agents suffer from high latency due to HTTP request/response overhead and sequential STT -> LLM -> TTS processing paradigms, leading to unnatural, turn-based conversations and poor user experience, compounded by hallucinations in domain-specific tasks.

Constraints

  • 1Ultra-low latency requirement (<2s total round trip)
  • 2Seamless interruption handling (barge-in)
  • 3Complex context management with Retrieval-Augmented Generation
  • 4Scalable infrastructure with robust orchestration

Approach

Eliminated HTTP hop overhead by implementing direct STT and TTS API integrations over a custom binary WebSocket protocol. Implemented speculative RAG pre-fetching to perform similarity searches concurrently with STT processing, using Qdrant and local ONNX embeddings to effectively bypass LLM latency penalties for retrieval tasks.

Architecture

The real-time orchestration backend is built with TypeScript and Fastify within a Turborepo monorepo, achieving maximum throughput. Infrastructure is provisioned via Terraform on AWS (ECS, ALB). The AI core relies on Gemini 2.5 Flash for rapid inference, enhanced by Qdrant for distributed vector search and in-memory local ONNX models (all-MiniLM-L6-v2) for low-latency embedding generation without external network calls.

System Architecture
InputProcessingLLM + ToolsOutputStore

Results

Achieved sub-2-second latency for complex voice interactions, saving 1200ms round-trip time through speculative RAG and removing secondary LLM calls. Reduced TTS generation latency by 43% via a parallel queue and smart sentence buffering, delivering a highly natural, interruption-capable user experience.

Learnings

  • 1

    Binary WebSocket protocols with custom headers significantly outperform HTTP polling for real-time voice streaming and orchestration.

  • 2

    Speculative execution in RAG (e.g., retrieving context during TTS playback or initial STT decoding) can drastically reduce perceived latency.

  • 3

    Parallel TTS processing and chunked audio buffering are critical for mimicking natural human delivery cadence.

Next Project

Gym PT App

Full-stack mobile app for gym management and real-time messaging