InstantAIguru Twilio ConversationRelay Architecture Documentation
Overview
The InstantAIguru Twilio ConversationRelay implementation provides a sophisticated, multi-layered voice AI system deployed across three distinct layers:
- Twilio ConversationRelay - Handles telephony, WebSocket connections, STT/TTS
- CloudFlare Worker - Edge-based intelligent relay with preprocessing and interrupt handling
- AWS Lambda - Backend AI processing with RAG retrieval and response generation
This architecture enables real-time voice conversations with low latency, streaming responses, and sophisticated intent detection.
Architecture Layers
Layer 1: Twilio ConversationRelay (Telephony Layer)
Responsibilities:
- Receive incoming phone calls
- Handle audio streaming to/from caller
- Speech-to-Text (STT) conversion
- Text-to-Speech (TTS) playback
- DTMF digit collection
- WebSocket connection management to CloudFlare Worker
Key Features:
- Real-time bidirectional audio streaming
- Google Cloud STT with telephony-optimized model
- Configurable TTS providers (Google, Amazon Polly, Microsoft Azure)
- Voice and language customization per phone number
- Interrupt detection (when user speaks during AI response)
- DTMF detection for numeric input workflows
- Session handoff with custom parameters
TwiML Configuration: The TwiML response uses the <Connect> verb with the <ConversationRelay> noun to establish a WebSocket connection. It configures parameters for STT/TTS providers, voice selection, and language. Custom <Parameter> tags are used to pass session context (user email, thread ID, configuration file, API key) to the WebSocket server.
Layer 2: CloudFlare Worker (Edge Relay Layer)
File: aws-functionurl-mapper/src/index.js
Responsibilities:
- WebSocket endpoint for Twilio ConversationRelay
- Edge-based preprocessing for simple intents (greetings, farewells)
- SSE stream consumption from AWS Lambda
- Real-time response chunking for TTS optimization
- Interrupt handling and session state management
- Multi-language support (English, Spanish)
WebSocket Routes: The system exposes specific routes for production and development environments, as well as distinct endpoints for English and Spanish language support.
Key Functions:
handleSession()
Sets up WebSocket connection and routes messages:
handleSession initializes the WebSocket connection and sets up event listeners. It handles the setup message to extract TwiML parameters into the session context, the prompt message to process user speech, dtmf for digit collection, and interrupt to handle user interruptions. It also ensures proper cleanup on connection close.
processInput()
Core processing function that handles preprocessing and Lambda communication:
Preprocessing Flow: processInput is the core logic function. It first checks for edge-based preprocessing matches (using AI or static rules). If a match is found, it sends an immediate response. If not, it encodes the user prompt and context, connects to the AWS Lambda backend via SSE, and streams the response chunks back to the client. It handles the parsing of SSE events (chunks, answers, context updates) and manages the session state (busy, inLambda).
Preprocessing Functions:
The worker includes 70+ regex patterns for fast intent detection of questions (questions, requests) and specific intents like 'greeting' and 'farewell', enabling immediate edge responses without backend latency.
System Message to Filler Mapping:
The worker converts Lambda progress messages (e.g., "Analyzing", "Searching", "Evaluating") into natural conversational fillers. This function maps system states to random variations of "hold music" phrases (e.g., "Let me check that...") to maintain user engagement during processing delays.
breakSession()
Cancels ongoing Lambda requests when user interrupts:
breakSession handles interruption logic. It sets a break flag, cancels any ongoing Lambda requests using an AbortController, cancels the SSE reader, and waits for the session busy state to clear, ensuring the system is ready for the next user input immediately.
Session State Management:
The WebSocket object maintains distinct flags to track the session status:
busy: Indicates if the system is currently processing a prompt.break: Signals that a user interruption has occurred.inLambda: Tracks if a backend AI request is pending.reader: Stores the active stream reader to allow for immediate cancellation.context: Preserves conversation context and parameters across turns.
Layer 3: AWS Lambda (Backend Processing Layer)
File: functions/handler.js
Functions:
twilioVoice
Generates initial TwiML to establish ConversationRelay connection:
twilioVoice verifies the Twilio request signature, sanitizes the phone number, and loads the specific configuration file for that number. It initializes the chat context manager and determines the correct WebSocket URL (Development/Production, English/Spanish). Finally, it generates and returns the XML response containing the <ConversationRelay> instruction with all necessary context parameters.
processHandoffChoice
Handles user input during transfer scenarios:
processHandoffChoice processes the results of a Twilio <Gather> action initiated during a handoff or menu flow. It evaluates the user's input (DTMF or speech) against configured options and generates the subsequent TwiML to route the call (e.g., dialing a specific phone number, playing a message, or hanging up).
askStream
Handles streaming responses via SSE:
askStream is an AWS Lambda function that uses response streaming. It accepts the prompt and context from the query parameters, validates authentication, and establishes an SSE stream. It initializes the chat context and invokes handleQuery to process the request, streaming the results back to the client.
handleQuery
Core query processing with RAG retrieval:
handleQuery orchestrates the AI processing. It performs pre-qualification to analyze intent and potentially rephrase the question. Based on the analysis, it either generates a direct response or triggers a RAG (Retrieval-Augmented Generation) workflow (doRAG) to search the knowledge base and generate an informed answer. It reports progress via the stream.
doRAG
Retrieval-Augmented Generation with streaming:
doRAG executes the retrieval-augmented generation process. It searches the OpenSearch index for relevant documents, builds a context block from the retrieval results, and calls the AI model (e.g., Claude, GPT) to generate a response based on the retrieved information, streaming chunks of the answer as they are generated.
returnStreamResponse
Finalizes SSE stream:
returnStreamResponse finalizes the SSE stream by writing the final data object and closing the stream connection.
2. WebSocket Connection Establishment
Twilio ConversationRelay receives TwiML
↓
Initiates WebSocket connection to wss://stream.instantaiguru.com/ws
↓
CloudFlare Worker receives WebSocket upgrade request
↓
handleSession() starts listening
CloudFlare Worker Processing:
- Accepts WebSocket upgrade request
- Creates WebSocketPair (client ↔ server)
- Returns client WebSocket to Twilio (101 Switching Protocols)
- Starts handleSession() on server WebSocket
- Stores connection parameters (Function URL, Dev flag, Language, Env)
- Sets up event listeners for:
message→ Route to handler based on typeclose→ Clean up sessionerror→ Handle errors gracefully
3. Session Setup
Twilio ConversationRelay sends "setup" message
↓
handleSession() processes setup
↓
Stores context from TwiML Parameters
Setup Message (from Twilio):
{
"type": "setup",
"callSid": "CA123abc456def",
"from": "+12025551234",
"to": "+18005551234",
"customParameters": {
"user_email": "+12025551234",
"display_name": "John Doe",
"thread_id": "CA123abc456def",
"config_file": "config.v-18005551234.json",
"api_key": "api-key-xyz",
"guru_name": "AI Assistant",
"checkLiveTransfer": "true"
}
}
CloudFlare Worker Processing:
- Extracts customParameters
- Stores context on WebSocket object:
- Detailed session parameters (email, thread ID, config files).
- Call metadata (SID, timestamps).
- Initializes session state flags (busy, break, inLambda, etc.).
4. User Prompt Processing
User speaks
↓
Twilio ConversationRelay performs Speech-to-Text
↓
Sends "prompt" message with voicePrompt
↓
handleSession() routes to processInput()
Prompt Message (from Twilio):
{
"type": "prompt",
"voicePrompt": "What are your business hours?"
}
5. Edge Preprocessing (CloudFlare Worker)
processInput() receives voicePrompt
↓
Try AI preprocessing (if enabled)
↓
Try static preprocessing
↓
If match found, send immediate response
↓
Otherwise, forward to Lambda
AI Preprocessing (Optional): The system attempts to classify the user's intent using a fast AI model (e.g., Claude Haiku). This model categorizes inputs into intents such as 'greeting', 'farewell', 'question', or 'live_agent_request'. If a simple intent like a greeting is detected, the worker generates a response immediately without invoking the potentially slower main backend.
Static Preprocessing: For even lower latency, the system checks the input against a set of regex patterns and static rules:
- Greeting Intent: Matches words like "hi", "hello".
- Farewell Intent: Matches words like "goodbye", "see you".
- Question Detection: Uses 70+ regex patterns to identify if the input is a question (e.g., starts with "what", "how", "can you").
If a static match is found (e.g., a greeting), a random friendly response is selected and sent immediately. If the input is identified as a complex question, or no static match is found, it is forwarded to the main Lambda backend.
If Preprocessed Response Found: When a locally generated response is available, it is sent directly to Twilio. The system then checks if the response implies terminating the call (e.g., "Goodbye") and sends the appropriate end-session signal if necessary.
6. Lambda Request (SSE Fetch)
No preprocessing match
↓
processInput() constructs Lambda URL
↓
Fetches Lambda askStream endpoint
↓
Opens SSE ReadableStream
Lambda URL Construction: The worker constructs the URL for the Lambda askStream endpoint. It appends the user's question and the full session context (encoded as a JSON string) as query parameters.
Fetch with Abort Controller: The system uses the fetch API to initiate the connection to Lambda. Crucially, it attaches an AbortController. This allows the worker to immediately cancel the pending network request if:
- The user interrupts (speaks again).
- The WebSocket connection closes.
- A network error occurs.
If the fetch fails (non-200 status), a localized error message is sent to the user. Upon success, a ReadableStream reader is obtained to process the incoming Server-Sent Events.
7. Lambda Processing (askStream)
Lambda receives request
↓
askStream() extracts question and context
↓
Validates config_file with api_key
↓
Sets up SSE response stream
↓
Calls handleQuery() with stream
askStream() Processing: The askStream function first extracts the prompt and context parameters. It performs a security check by validating the provided API key against the configuration file. Once validated, it initializes the HTTP response stream with the proper headers for Server-Sent Events (Content-Type: text/event-stream). It then instantiates the chat context manager and hands off execution to the core handleQuery logic, which will write chunks directly to the open stream.
8. Query Handling (handleQuery + doRAG)
handleQuery() starts processing
↓
Sends SSE: { progress: "Analyzing..." }
↓
Performs preQualification (intent analysis)
↓
Sends SSE: { progress: "Rephrased Question: ..." }
↓
Calls doRAG() for retrieval
↓
Sends SSE: { progress: "Searching knowledge base..." }
↓
Retrieves from OpenSearch
↓
Sends SSE: { progress: "Selected 5 relevant documents" }
↓
Generates response with streaming
↓
Sends SSE: { chunk: "Our business" }
↓
Sends SSE: { chunk: " hours are" }
↓
Sends SSE: { chunk: " Monday-Friday" }
↓
Sends SSE: { chunk: " 9am-5pm" }
↓
Sends SSE: { answer: "Our business hours are Monday-Friday 9am-5pm" }
SSE Event Sequence: The protocol involves a sequence of JSON messages sent over the stream:
- Status: Initial connection confirmation (
status: "connected"). - Progress: Updates on the AI's thought process (e.g., "Analyzing", "Searching knowledge base").
- Chunk: Pieces of the generated response text streamed token-by-token.
- Answer: The final complete text for logging and verification.
9. SSE Stream Consumption (CloudFlare Worker)
processInput() reads from SSE stream
↓
Accumulates buffer
↓
Splits on '\n\n'
↓
Parses JSON events
↓
Categorizes: chunks, answers, context
↓
Sends to Twilio ConversationRelay
SSE Parsing Loop: The SSE parsing logic runs in a loop, reading chunks from the Lambda stream. It buffers and splits data into events, then processes them based on type: 'chunk' (intermediate text), 'answer' (final text), 'context' (updates), or 'progress' (system status). It handles sanitization, chunking adjustments for TTS, and checks for special signals like goodbye or live agent transfer.
10. Text-to-Speech Playback
Twilio ConversationRelay receives text chunks
↓
Queues for TTS synthesis
↓
Streams audio to caller
↓
Monitors for user interrupt
ConversationRelay Processing:
- Receives text message from CloudFlare Worker
- Checks
interruptibleandpreemptibleflags:interruptible: true→ User can speak to interruptpreemptible: true→ This chunk can be skipped if new content arrivespreemptible: false→ This chunk must be played completely
- Queues for TTS synthesis using configured provider (Google, Polly, Azure)
- Synthesizes audio chunk
- Streams audio to caller
- If
last: false, waits for next chunk - If
last: true, finalizes and awaits next user input
TTS Configuration: The TTS settings (provider, voice, language) are defined in the initial TwiML configuration and respected throughout the session.
Audio Streaming:
- Sample rate: 8000 Hz (Twilio default for voice)
- Codec: μ-law (G.711)
- Latency: ~200-500ms from text to audio start
11. User Interrupt Handling
User speaks during AI response
↓
Twilio ConversationRelay detects interrupt
↓
Sends "interrupt" message
↓
CloudFlare Worker calls breakSession()
↓
Cancels ongoing Lambda SSE stream
↓
Waits for new prompt
Interrupt Message (from Twilio):
{
"type": "interrupt",
"timestamp": "2024-01-15T10:30:45.123Z"
}
CloudFlare Worker Processing: When the worker receives an interrupt message, it logs the event and immediately calls the breakSession() utility to halt all current activities.
breakSession() Implementation: The breakSession function performs a clean handling of the interruption:
- Sets Flags: Marks the session as 'breaking' to prevent new logic from starting.
- Cancels Backend: Aborts any pending fetch requests to the Lambda function.
- Cancels Stream: Cancels the active SSE reader to stop processing incoming chunks.
- Waits for Clear: Loops briefly to ensure the 'busy' state has been fully reset.
- Resets State: Returns the session to a clean IDLE state, ready for the new input.
Result:
- Lambda SSE stream is cancelled via AbortController
- CloudFlare Worker stops sending text to Twilio
- Session is ready for next user input
12. Session Termination
Goodbye Detection:
Assistant response contains goodbye phrase
↓
isGoodbyeResponse() returns true
↓
CloudFlare sends "end" message with handoffData
↓
Twilio ConversationRelay ends call
Goodbye Patterns: The system uses regex pattern matching on the assistant's response to detect if the conversation should end ("goodbye"). If triggered, it sends a termination signal with the appropriate reason code.
End Message:
{
"type": "end",
"handoffData": "{\"reasonCode\":\"goodbye\",\"reason\":\"The assistant said goodbye\",\"startTime\":\"2024-01-15T10:25:30.000Z\"}"
}
Live Agent Transfer:
Assistant response indicates live agent needed
↓
isLiveAgentRequest() returns true
↓
CloudFlare sends "end" message with live_agent_request
↓
Twilio ConversationRelay triggers /v1/twilioVoiceAction
↓
twilioVoiceAction() Lambda handles handoff
Live Agent Patterns: Similarly, it checks if the AI suggests transferring to a live agent. If triggered, it sends a termination signal with the live_agent_request reason code.
Live Agent End Message:
{
"type": "end",
"handoffData": "{\"reasonCode\":\"live_agent_request\",\"reason\":\"The assistant determined a live agent is needed\",\"startTime\":\"2024-01-15T10:25:30.000Z\"}"
}
Visual Flow Diagram
┌─────────────────────────────────────────────────────────────────────────┐
│ INCOMING PHONE CALL │
└─────────────────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────────────┐
│ TWILIO → POST /v1/twilioVoice │
│ • CallSid, From, To, CallStatus │
└─────────────────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────────────┐
│ LAMBDA: twilioVoice() │
│ • Validate Twilio signature │
│ • Load config.v-{phone}.json │
│ • Create chat context manager │
│ • Return TwiML with ConversationRelay │
└─────────────────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────────────┐
│ TwiML RESPONSE │
│ <ConversationRelay │
│ url="wss://stream.instantaiguru.com/ws" │
│ welcomeGreeting="Hello!" │
│ language="en-US" │
│ ttsProvider="google" │
│ voice="en-US-Neural2-H"> │
│ <Parameter name="user_email" value="..."/> │
│ <Parameter name="thread_id" value="..."/> │
│ <Parameter name="config_file" value="..."/> │
│ <Parameter name="api_key" value="..."/> │
│ </ConversationRelay> │
└─────────────────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────────────┐
│ TWILIO CONVERSATIONRELAY │
│ WebSocket Connection → wss://stream.instantaiguru.com/ws │
└─────────────────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────────────┐
│ CLOUDFLARE WORKER: handleSession() │
│ • Accept WebSocket │
│ • Listen for messages │
└─────────────────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────────────┐
│ MESSAGE TYPE: "setup" │
│ • Extract customParameters from TwiML │
│ • Store context: {user_email, thread_id, config_file, api_key} │
│ • Initialize session state │
└─────────────────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────────────┐
│ USER SPEAKS → Twilio STT → voicePrompt │
└─────────────────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────────────┐
│ MESSAGE TYPE: "prompt" │
│ { "type": "prompt", "voicePrompt": "What are your hours?" } │
└─────────────────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────────────┐
│ CLOUDFLARE WORKER: processInput() │
│ ┌───────────────────────────────────────────────────────────────────┐ │
│ │ PREPROCESSING (Edge) │ │
│ │ • Try AI preprocessing (fast model) │ │
│ │ • Try static pattern matching │ │
│ │ • Intents: greeting, farewell, simple Q&A │ │
│ └───────────────────────────────────────────────────────────────────┘ │
│ │ │ │
│ Matched? Not Matched │
│ │ │ │
│ ▼ ▼ │
│ ┌────────────────────────────────┐ ┌───────────────────────────────┐ │
│ │ Send Immediate Response │ │ Forward to Lambda │ │
│ │ • ws.send({type:"text",...}) │ │ • Encode question & context │ │
│ │ • Check goodbye/live agent │ │ • Fetch SSE stream │ │
│ │ • Send "end" if needed │ │ • Open ReadableStream │ │
│ └────────────────────────────────┘ └───────────────────────────────┘ │
└─────────────────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────────────┐
│ LAMBDA: askStream() │
│ • Extract question and context from query params │
│ • Validate config_file with api_key │
│ • Set up SSE response stream │
│ • stream.write("data: {\"status\":\"connected\"}\n\n") │
└─────────────────────────────────────────────────────────────────────────┘
│
▼
┌────────────────────────────────────────────────────────────────────────┐
│ LAMBDA: handleQuery() │
│ ┌───────────────────────────────────────────────────────────────────┐ │
│ │ SSE: {"progress": "Analyzing..."} │ │
│ └───────────────────────────────────────────────────────────────────┘ │
│ ┌───────────────────────────────────────────────────────────────────┐ │
│ │ preQualification() │ │
│ │ • Intent analysis │ │
│ │ • Question rephrasing │ │
│ └───────────────────────────────────────────────────────────────────┘ │
│ ┌───────────────────────────────────────────────────────────────────┐ │
│ │ SSE: {"progress": "Rephrased Question: ..."} │ │
│ └───────────────────────────────────────────────────────────────────┘ │
│ ┌───────────────────────────────────────────────────────────────────┐ │
│ │ doRAG() │ │
│ │ • Query OpenSearch with embeddings │ │
│ │ • Retrieve top-k relevant documents │ │
│ └───────────────────────────────────────────────────────────────────┘ │
│ ┌───────────────────────────────────────────────────────────────────┐ │
│ │ SSE: {"progress": "Selected 5 relevant documents"} │ │
│ └───────────────────────────────────────────────────────────────────┘ │
│ ┌───────────────────────────────────────────────────────────────────┐ │
│ │ generateStreamingResponse() │ │
│ │ • Build RAG context │ │
│ │ • Call AI model (OpenAI, Bedrock, Google) │ │
│ │ • Stream chunks as they arrive │ │
│ └───────────────────────────────────────────────────────────────────┘ │
│ ┌───────────────────────────────────────────────────────────────────┐ │
│ │ SSE: {"chunk": "Our business"} │ │
│ │ SSE: {"chunk": " hours are"} │ │
│ │ SSE: {"chunk": " Monday-Friday"} │ │
│ │ SSE: {"chunk": " 9am-5pm"} │ │
│ └───────────────────────────────────────────────────────────────────┘ │
│ ┌───────────────────────────────────────────────────────────────────┐ │
│ │ SSE: {"answer": "Our business hours are Monday-Friday 9am-5pm"} │ │
│ └───────────────────────────────────────────────────────────────────┘ │
│ ┌───────────────────────────────────────────────────────────────────┐ │
│ │ stream.end() │ │
│ └───────────────────────────────────────────────────────────────────┘ │
└────────────────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────────────┐
│ CLOUDFLARE WORKER: processInput() SSE Parsing │
│ • Read from stream.body.getReader() │
│ • Accumulate buffer, split on "\n\n" │
│ • Parse JSON events: chunk, answer, context, progress │
│ • Send to Twilio ConversationRelay │
└─────────────────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────────────┐
│ CHUNK EVENTS → Relay to Twilio │
│ ws.send({ │
│ type: "text", │
│ token: "Our business", │
│ last: false, │
│ interruptible: true, │
│ preemptible: false │
│ }) │
└─────────────────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────────────┐
│ ANSWER EVENT → Final Response │
│ ws.send({ │
│ type: "text", │
│ token: "", // or full text if no chunks sent │
│ last: true │
│ }) │
│ │
│ Check for goodbye or live agent transfer │
│ If detected: │
│ ws.send({ │
│ type: "end", │
│ handoffData: JSON.stringify({ │
│ reasonCode: "goodbye" | "live_agent_request", │
│ reason: "...", │
│ startTime: "..." │
│ }) │
│ }) │
└─────────────────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────────────┐
│ TWILIO CONVERSATIONRELAY │
│ • Queue text chunks for TTS │
│ • Synthesize audio (Google TTS) │
│ • Stream audio to caller │
│ • Monitor for user interrupt │
│ • If "end" message: terminate call or handoff │
└─────────────────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────────────┐
│ CALLER HEARS AI RESPONSE │
│ "Our business hours are Monday through Friday from 9 AM to 5 PM." │
└─────────────────────────────────────────────────────────────────────────┘
│
▼
┌───────────────────────────┴─────────────────────────────┐
│ │
▼ ▼
┌───────────────────┐ ┌─────────────────────┐
│ User Speaks Again │ │ User Says Goodbye │
│ (loop to prompt) │ │ or Requests Agent │
└───────────────────┘ └─────────────────────┘
│
▼
┌─────────────────────┐
│ Call Ends or │
│ Transfers to Agent │
└─────────────────────┘
Key Technical Details
1. Multi-Language Support
WebSocket Routing:
/ws→ English production/ws-dev→ English development/ws-es→ Spanish production/ws-es-dev→ Spanish development
Language-Specific Processing: The system uses the ws.language session property to select the appropriate preprocessing functions, goodbye detectors, and intent classifiers. It also localizes error messages sent to the user (e.g., "Sorry, there was a connection error" vs. "Lo siento, hubo un error de conexión").
Spanish Functions: Specialized regex functions (isGoodbyeResponse_ES, isLiveAgentRequest_ES) are implemented to detect Spanish phrases for termination or transfer (e.g., "Adiós", "Déjame transferirte").
2. DTMF and Numeric Input
Digit Accumulation: The system handles DTMF input by accumulating digits arriving via WebSocket.
- End Marker: If the user presses
#, the accumulated sequence is immediately processed. - Max Digits: If the count matches
ws.context.digits.max, it auto-submits. - Context: Minimum and maximum digit constraints can be set dynamically via the session context.
- Voice Digits: The system also extracts digits spoken by the user (e.g., "one two three") from the voice prompt if a numeric input is expected.
Context Configuration: Numeric input constraints (min/max digits) can be configured via session parameters.
3. TTS Optimization
Response Sanitization & Chunking: Response text is split into chunks (max 256 characters) to optimize for TTS playback latency. This ensures that the telephony provider can start speaking the first part of a long sentence while the rest is being processed or transmitted. Additionally, text is sanitized to remove URLs, email addresses, and special characters that might cause TTS errors.
Sending Chunked Response: The system iterates through the text chunks, sending them to Twilio with the interruptible: true flag. This allows the user to cut in at any point. Small delays are inserted between chunks to prevent network jitter from causing out-of-order playback.
4. Session State Management
State Flags: The system tracks the session's lifecycle using properties on the WebSocket object:
busy: Prevents concurrent input processing.break: Signals an active interruption.inLambda: Indicates wait-time for backend generation.chunksSent: Tracks progress of the current response.fullResponse: Accumulates the complete answer for logging.
State Transitions: The session moves from IDLE to BUSY when input arrives. During BUSY, it may transition to IN_LAMBDA while fetching. Once the answer is complete (or interrupted), it returns to IDLE.
5. Error Handling
Lambda Request Errors: When the Lambda function returns a non-200 status or a missing body, the system sends a localized error message (English or Spanish) to the user via TTS and resets the session usage flags (inLambda, busy).
SSE Parsing Errors: Incoming SSE data lines are parsed safely; invalid JSON is logged and skipped to prevent session crashes.
WebSocket Errors: The worker listens for WebSocket errors. If the error is related to a closed connection or network loss, it is ignored to prevent log spam. Errors on open sessions are logged for debugging.
Global Error Handlers: Global event listeners are attached to the CloudFlare Worker environment to catch and suppress unhandled exceptions or rejections, ensuring the instance doesn't crash unexpectedly logging the error context instead.
6. Performance Optimizations
Pre-compiled Regex Patterns: To ensure minimal latency at the edge, heavy regex patterns (for question detection or intent matching) are compiled once at the module scope rather than per-request.
Batched SSE Processing: Network chunks are naturally batched. The system buffers incomplete events and processes complete JSON payloads from the SSE stream in batches to optimize network throughput.
Response Streaming: The Lambda function streams AI-generated content chunks immediately to the response stream as Server-Sent Events (SSE), ensuring low latency without buffering.
CloudFlare Edge Locations:
- Global edge network (275+ cities)
- Low latency to both Twilio and AWS Lambda
- WebSocket termination at edge
- Reduces round-trip time for preprocessing
Configuration Files
Phone Number Configuration
Each specific phone number has a JSON configuration file defining:
- Identity: API keys, assistant name, and display name.
- Voice Settings: TTS vendor (e.g., Google, Azure), voice ID, and greeting message.
- AI Settings: Preferred model (e.g., GPT-4o), RAG settings (enabled/disabled, OpenSearch indexes), and system instructions.
- Workflows: Logical triggers and steps for specific tasks like appointment scheduling.
Deployment
CloudFlare Worker Deployment
The CloudFlare Worker (ConversationRelay) is deployed globally. It is configured with environment variables to point to the correct AWS Lambda backend URLs (Function URLs) for both production and development environments.
Lambda Deployment
The AWS backend is deployed using the Serverless Framework. Key functions include:
- twilioVoice: Handles the initial webhook from Twilio to generate TwiML.
- askStream: Handles the WebSocket stream requests for AI processing. Both are configured with appropriate timeouts (e.g., 30s) and CORS settings.
Monitoring and Debugging
CloudFlare Worker Logs
Log monitoring provides real-time visibility into the edge connection. Key events tracked include:
- Call setup and parameter extraction (Caller ID, Language).
- AI Preprocessing decisions (Response vs. No Response).
- Backend Fetch initiation and status.
- WebSocket session events (Open, Close, Error).
Lambda Logs (CloudWatch)
Backend logs provide detailed execution traces for the AI logic:
- Configuration loading and API key validation.
- Query understanding and rephrasing.
- RAG document selection specifics.
- Streaming response generation milestones.
DynamoDB Logging
The system maintains granular logs in DynamoDB for auditing and analysis:
- logging-{stage}: Detailed request/response pairs.
- chats-{stage}: Full chat transcripts for analysis.
- voice-calls-{stage}: Metadata about voice sessions including duration and status.
Troubleshooting
Issue: WebSocket Connection Fails
Symptoms:
- ConversationRelay can't connect to CloudFlare Worker
- Immediate call termination
Causes:
- CloudFlare security settings blocking WebSocket
- Invalid WebSocket URL in TwiML
- Worker not deployed or crashed
Solutions:
- Verify Deployment: Check that the CloudFlare Worker is active and the
wranglerdeployment logs show success. - Check Logs: Use
wrangler tailto view real-time errors in the worker. - Verify CloudFlare Settings:
- Under Scrape Shield: Disable "Email Address Obfuscation"
- Under Security: Set WAF to "Low" (if necessary)
- Under Network: Enable WebSockets
Issue: No Response from AI
Symptoms:
- User speaks but gets no response
- Timeout after 30 seconds
Causes:
- Lambda timeout (30s max)
- OpenSearch query timeout
- AI model throttling
Solutions:
- Timeouts: Verify Lambda timeout settings are sufficient (e.g., >30s).
- Infrastructure: Verify OpenSearch connectivity and AI provider API status/quotas.
Issue: TTS Error 64111
Symptoms:
- Twilio error: "Unable to synthesize text"
- Call drops after AI response
Cause:
- Response contains URLs or special characters that TTS can't handle.
Solution: The system automatically sanitizes text before sending it to TTS, stripping URLs, emails, and replacing special characters.
Issue: User Interrupt Not Working
Symptoms:
- User speaks but AI continues talking
- No interrupt detection
Causes:
interruptibleflag not set on response chunks.- Session break logic failing to cancel backend requests.
Solutions: Ensure the backend sends the interruptible: true flag on text chunks and that the WebSocket listener correctly triggers session cancellation routines upon receiving user input.
Summary
This Twilio ConversationRelay implementation provides a production-ready, scalable voice AI system with the following key characteristics:
Architecture:
- 3-layer design: Twilio → CloudFlare → Lambda
- Edge-based preprocessing for low latency
- Server-Sent Events for streaming responses
- Real-time interrupt handling
Performance:
- Global edge presence via CloudFlare (275+ cities)
- Sub-second response times for simple queries
- Streaming AI responses for natural conversation flow
- Optimized TTS chunking to prevent buffer overflow
Features:
- Multi-language support (English, Spanish, extensible)
- DTMF and voice digit collection
- Live agent transfer capability
- Goodbye detection and call termination
- Configurable per phone number
- Workflow engine for complex interactions
- RAG retrieval from OpenSearch knowledge base
Reliability:
- Comprehensive error handling at all layers
- Automatic retry and fallback logic
- Session state management with graceful cleanup
- Abort controllers for request cancellation
- Dead session timeout detection
Monitoring:
- CloudFlare Worker logs (real-time)
- Lambda CloudWatch logs
- DynamoDB logging for analytics
- Detailed debug logging throughout
This architecture provides a robust foundation for building sophisticated voice AI applications with enterprise-grade reliability and performance.