Voice-enabled AI assistant accessible via phone call. This backend service orchestrates telephony, speech recognition, AI processing, and text-to-speech synthesis to enable natural conversations with AI through any phone.
HeyAI Backend is a Go-based microservice that serves as the orchestration layer between Twilio Voice API, external AI agents, and ElevenLabs text-to-speech. Users can call a phone number, speak their questions, and receive AI-generated responses in natural-sounding voice.
The system follows a multi-tier architecture with the following components:
- User Interaction: User dials the Twilio phone number and speaks a question
- Twilio Voice API: Receives the call, transcribes speech to text, and forwards to backend
- HeyAI Backend (Go): Processes the request and orchestrates:
- Text-to-speech conversion via ElevenLabs API
- AI response generation via external agent service
- Authorization and call management
- External AI Agents: Python-based AI service (Sesame AI or 11 Labs) hosted on Cloud Run
- Dashboard Backend: Manages agent connections and analytics
- BigQuery: Stores call logs and analytics data
- Admin Console: Frontend interface for managing agents and viewing analytics
User Call → Twilio Voice API → HeyAI Backend (Go) → External AI Agent
↓
ElevenLabs TTS
↓
Dashboard Backend
↓
BigQuery
- Language: Go 1.25.4
- Runtime: Google Cloud Run (serverless containers)
- Containerization: Docker with multi-stage builds
- CI/CD: Google Cloud Build
-
Telephony: Twilio Voice API
- Speech recognition (speech-to-text)
- Call management and routing
- TwiML response handling
-
AI Processing: External Python AI Service
- Vertex AI hosted Gemini 2.5 Flash
- Streaming response support
- Custom agent endpoints
-
Voice Synthesis: ElevenLabs API
- Text-to-speech conversion
- High-quality voice generation
- MP3 audio streaming
- Cloud Run: Serverless container hosting
- Artifact Registry: Container image storage
- Secret Manager: Secure credential management
- Cloud Build: Automated CI/CD pipeline
- BigQuery: Analytics and call data storage (planned)
cloud.google.com/go/vertexai v0.15.0
github.com/joho/godotenv v1.5.1
Initial Twilio webhook endpoint that handles incoming calls.
Response: TwiML XML instructing Twilio to gather speech input
Example Response:
<Response>
<Say voice="alice">Hi — welcome. Please ask your question after the beep.</Say>
<Gather input="speech" action="/speech-result" method="POST" speechTimeout="auto"/>
</Response>Processes transcribed speech from Twilio and generates AI responses.
Request Parameters:
SpeechResult: Transcribed user speech from TwilioFrom: Caller's phone number
Response: TwiML XML with audio playback and continuation prompt
Flow:
- Receives transcribed speech from Twilio
- Forwards question to external AI agent service
- Generates audio from AI response via ElevenLabs
- Returns TwiML with audio URL and continuation prompt
Generates and streams text-to-speech audio.
Query Parameters:
text: Text to convert to speech
Response: MP3 audio stream (audio/mpeg)
Implementation:
- Calls ElevenLabs API with configured voice ID
- Streams MP3 audio directly to caller
- Includes cache control headers
Required environment variables:
# ElevenLabs Configuration
ELEVENLABS_API_KEY=your_elevenlabs_api_key
ELEVEN_VOICE_ID=your_voice_id
# Server Configuration
PORT=8080
# Google Cloud Configuration (for Cloud Run deployment)
GCP_PROJECT_ID=your_project_id
GCP_REGION=us-central1
# External Services
KOOZIE_AGENT_URI=https://your-agent-service.run.appSecrets are managed via Google Cloud Secret Manager in production:
ELEVENLABS_API_KEY: ElevenLabs API authenticationELEVEN_VOICE_ID: Voice model identifier
- Install Go 1.25 or higher
- Clone the repository
- Copy
.env.exampleto.envand configure variables - Install dependencies:
go mod download
- Run the server:
go run main.go
- Expose local server with ngrok:
ngrok http 8080
- Configure Twilio webhook URL to ngrok endpoint
The service is deployed to Google Cloud Run via Cloud Build:
- Build: Multi-stage Docker build creates optimized binary
- Push: Image pushed to Artifact Registry
- Deploy: Cloud Run service updated with new image
Deployment Command:
gcloud builds submit --config cloudbuild.yamlCloud Run Configuration:
- Platform: Managed
- Region: us-central1
- Port: 8080
- Authentication: Allow unauthenticated (for Twilio webhooks)
- Secrets: Injected from Secret Manager
- Natural language conversation via phone call
- Multi-turn conversation support with context
- Real-time speech-to-text via Twilio
- AI response generation via external agent service
- High-quality text-to-speech via ElevenLabs
- Graceful error handling and fallbacks
- Conversation termination on user request
- Structured logging for debugging
- Secure credential management
- Containerized deployment
- Auto-scaling serverless infrastructure
- Call recording and transcription storage
- BigQuery integration for analytics
- Multi-language support
- Custom voice selection per agent
- WebSocket streaming for reduced latency
- Admin dashboard integration
- Usage metrics and monitoring
The backend communicates with a Python-based AI service that handles:
- Gemini 2.5 Flash model inference
- Streaming response generation
- Context management
- Agent-specific logic
API Contract:
POST /chat
{
"message": "user question"
}
Response: Server-Sent Events (SSE) stream
data: {"text": "response chunk"}Twilio webhooks are configured to point to:
/voice- Initial call handling/speech-result- Speech processing
Planned integration for:
- Call analytics
- Agent management
- Usage tracking
- BigQuery data storage
HeyAI-backend/
├── main.go # Main application entry point
├── go.mod # Go module dependencies
├── go.sum # Dependency checksums
├── Dockerfile # Multi-stage container build
├── cloudbuild.yaml # Cloud Build CI/CD configuration
├── .env # Local environment variables
├── .gitignore # Git ignore rules
└── README.md # This file
voiceHandler: Handles initial Twilio call webhookspeechResultHandler: Processes speech and generates responsesaudioHandler: Streams TTS audioaskPythonAI: Communicates with external AI servicegenerateElevenLabsAudio: Generates speech from text
- Response latency: Sub-3 seconds from question to audio playback
- Concurrent request handling via Go's native concurrency
- Stateless design for horizontal scaling
- Optimized Docker image with distroless base
- Secrets stored in Google Cloud Secret Manager
- Non-root container execution
- HTTPS-only communication
- Environment variable validation
- Input sanitization for TwiML generation
MIT License
