Voice AI

Overview

Building voice AI applications with Twilio, LLMs, and speech services

Voice AI

Notes from building call-gpt - a generative AI phone calling system.

Architecture

┌─────────────┐     ┌─────────────┐     ┌─────────────┐
│   Caller    │────▶│   Twilio    │────▶│  Your App   │
│  (Phone)    │◀────│   Media     │◀────│  (Node.js)  │
└─────────────┘     │  Streams    │     └──────┬──────┘
                    └─────────────┘            │
                                               ▼
                    ┌──────────────────────────┴──────────────────────────┐
                    │                                                      │
              ┌─────▼─────┐     ┌─────────────┐     ┌─────────────────┐
              │   STT     │     │    LLM      │     │      TTS        │
              │ Deepgram  │────▶│  OpenAI/    │────▶│  Deepgram/      │
              │           │     │  Claude     │     │  ElevenLabs     │
              └───────────┘     └─────────────┘     └─────────────────┘

Data Flow

  1. Incoming call → Twilio receives call, opens WebSocket to your server
  2. Audio in → Twilio streams caller audio (mulaw 8kHz) via WebSocket
  3. Speech-to-Text → Deepgram transcribes audio in real-time
  4. LLM Processing → OpenAI/Claude generates response (streaming)
  5. Text-to-Speech → Deepgram/ElevenLabs converts text to audio
  6. Audio out → Stream audio back through Twilio WebSocket to caller

Key Features

FeatureImplementation
Low latency (~1s)Streaming at every stage
Interruption handlingDetect speech, cancel current response
Conversation historyMaintain context with LLM
Function callingLLM can trigger external tools

Services Used