Voice AI

Overview

Building voice AI applications with Twilio, LLMs, and speech services

Voice AI

Notes from building call-gpt - a generative AI phone calling system.

Architecture

┌─────────────┐     ┌─────────────┐     ┌─────────────┐
│   Caller    │────▶│   Twilio    │────▶│  Your App   │
│  (Phone)    │◀────│   Media     │◀────│  (Node.js)  │
└─────────────┘     │  Streams    │     └──────┬──────┘
                    └─────────────┘            │
                                               ▼
                    ┌──────────────────────────┴──────────────────────────┐
                    │                                                      │
              ┌─────▼─────┐     ┌─────────────┐     ┌─────────────────┐
              │   STT     │     │    LLM      │     │      TTS        │
              │ Deepgram  │────▶│  OpenAI/    │────▶│  Deepgram/      │
              │           │     │  Claude     │     │  ElevenLabs     │
              └───────────┘     └─────────────┘     └─────────────────┘

Data Flow

Incoming call → Twilio receives call, opens WebSocket to your server
Audio in → Twilio streams caller audio (mulaw 8kHz) via WebSocket
Speech-to-Text → Deepgram transcribes audio in real-time
LLM Processing → OpenAI/Claude generates response (streaming)
Text-to-Speech → Deepgram/ElevenLabs converts text to audio
Audio out → Stream audio back through Twilio WebSocket to caller

Key Features

Feature	Implementation
Low latency (~1s)	Streaming at every stage
Interruption handling	Detect speech, cancel current response
Conversation history	Maintain context with LLM
Function calling	LLM can trigger external tools

Services Used

Twilio Media Streams - Bidirectional audio WebSocket
WebSocket Streaming - Real-time data flow
LLM Integration - OpenAI and Claude
Speech Services - STT and TTS

Edit this pageorReport an issue

Verification

Verify certificates and test connections

Twilio Media Streams

Bidirectional audio streaming for phone calls