Voice Agent
Build real-time voice conversations with WebSocket-based bidirectional audio streaming, supporting multiple languages and intelligent interruption handling.
Overview
The Voice Agent API enables real-time voice conversations through WebSocket connections. It combines speech-to-text, language models, and text-to-speech into a seamless conversational experience with support for interruptions, translation, and multi-language conversations.
Real-Time
Bidirectional audio streaming with low latency for natural conversations.
Multi-Language
Auto-detect languages and translate conversations in real-time.
Interruption
Intelligent interruption handling for natural conversation flow.
Warning
Information
WebSocket Connection
Connect to the Voice Agent API using WebSocket at wss://api.selamgpt.com with namespace /v1/audio/agent
Connection Requirements
token- JWT Bearer token (required)api_key- API key for authentication (required)voice- Voice ID for the agent (required)1import { io } from 'socket.io-client';
2
3// Connect to Voice Agent with namespace
4const socket = io('wss://api.selamgpt.com/v1/audio/agent', {
5 query: {
6 token: 'your-jwt-token',
7 api_key: 'your-api-key',
8 voice: 'mekdes' // Amharic voice
9 },
10 transports: ['websocket']
11});
12
13// Handle connection
14socket.on('connected', (data) => {
15 console.log('Connected:', data);
16 // { status: 'connected', sid: '...', voice: 'mekdes', message: '...' }
17
18 // Start session
19 socket.emit('start_session', {
20 voice: 'mekdes',
21 language: 'am',
22 // Note: For Amharic, use mekdes/ameha/betty. GPT Audio voices only support English.
23 instructions: 'You are a helpful voice assistant.'
24 });
25});
26
27// Handle session started
28socket.on('session_started', (data) => {
29 console.log('Session started:', data);
30 // Now you can send audio chunks
31});
32
33// Handle errors
34socket.on('error', (error) => {
35 console.error('Error:', error);
36});Audio Format Requirements
Voice Agent requires specific audio formats for optimal performance:
Format
PCM1616-bit PCM audio
Sample Rate
16 kHz16,000 samples/second
Channels
MonoSingle channel audio
Information
Session Management
Voice Agent sessions follow a structured lifecycle with clear events and states:
1. Start Session
Initialize a new voice conversation session. For Ethiopian languages, use native voices (mekdes, ameha, betty for Amharic). GPT Audio voices (alloy, echo, etc.) only support English:
1socket.emit('start_session', {
2 voice: 'mekdes', // Amharic voice (required)
3 language: 'am', // Language code (optional, auto-detected)
4 instructions: 'You are a helpful assistant.', // System prompt (optional)
5 vad_threshold: 600, // Voice activity threshold (default: 600)
6 interrupt_threshold: 8 // Interruption sensitivity (default: 8)
7 // Note: Use mekdes/ameha/betty for Amharic, ubax/muuse for Somali, etc.
8});2. Send Audio
Stream audio chunks to the agent:
1// Capture audio from microphone
2navigator.mediaDevices.getUserMedia({ audio: true })
3 .then(stream => {
4 const mediaRecorder = new MediaRecorder(stream);
5
6 mediaRecorder.ondataavailable = (event) => {
7 const reader = new FileReader();
8 reader.onload = () => {
9 const audioData = reader.result.split(',')[1]; // Base64
10
11 socket.emit('audio_chunk', {
12 audio: audioData,
13 format: 'pcm16',
14 sample_rate: 16000
15 });
16 };
17 reader.readAsDataURL(event.data);
18 };
19
20 mediaRecorder.start(100); // Send chunks every 100ms
21 });3. Receive Responses
Listen for transcriptions, text responses, and audio:
1// Transcription of user speech
2socket.on('transcription', (data) => {
3 console.log('User said:', data.text);
4 // { text: 'Hello, how are you?', is_final: true }
5});
6
7// AI text response
8socket.on('response', (data) => {
9 console.log('AI response:', data.text);
10 // { text: 'I am doing well, thank you!' }
11});
12
13// Audio response chunks
14socket.on('audio_chunk', (data) => {
15 const audioData = atob(data.audio); // Decode base64
16 // Play audio chunk
17 playAudioChunk(audioData);
18});
19
20// Audio streaming complete
21socket.on('audio_end', (data) => {
22 console.log('Audio complete:', data.total_chunks);
23});4. End Session
Gracefully close the session:
1socket.emit('end_session');
2
3socket.on('session_ended', (data) => {
4 console.log('Session ended:', data);
5 // { session_id: '...', status: 'ended' }
6});Interruption Handling
Voice Agent supports intelligent interruption handling for natural conversations. Users can interrupt the AI while it's speaking, just like in human conversations.
Control Actions
interruptStop the current speech immediately
start_listeningResume voice detection
stop_listeningPause voice detection
pausePause the session
resumeResume the session
1// Interrupt current speech
2socket.emit('control', { action: 'interrupt' });
3
4// Pause listening
5socket.emit('control', { action: 'stop_listening' });
6
7// Resume listening
8socket.emit('control', { action: 'start_listening' });
9
10// Handle control acknowledgment
11socket.on('control_ack', (data) => {
12 console.log('Control executed:', data);
13 // { action: 'interrupt', status: 'executed' }
14});Tip
interrupt_threshold parameter controls how sensitive the interruption detection is. Lower values (1-5) make it easier to interrupt, while higher values (10-15) require more confident speech.Translation Support
Voice Agent supports real-time translation with automatic language detection. Speak in any supported language and get responses in your preferred language.
Auto-Detection Mode
Automatically detect the user's language and respond in the same language. Use native voices for each language.
1socket.emit('start_session', {
2 voice: 'mekdes', // Amharic voice
3 // No language specified - auto-detect
4 // Will detect and respond in Amharic
5});Target Language Mode
Specify a target language for responses. Use the appropriate native voice for each language.
1socket.emit('start_session', {
2 voice: 'mekdes', // Amharic voice
3 language: 'am', // Respond in Amharic
4 // Use mekdes/ameha/betty for Amharic
5});Supported Languages
enEnglishamAmharicsoSomalitiTigrinyaomOromoaaAfarAccess Requirements
Voice Agent is currently in limited access for select developers:
Current Access Status
Limited Access Program
Available to approved developers with unlimited usage
Authentication Required
Both JWT token and API key authentication
No Rate Limits
Unlimited concurrent sessions and requests
Information
Complete Example
Here's a complete example of a Voice Agent implementation:
1import { io } from 'socket.io-client';
2import PageFeedback from '@/components/docs/PageFeedback';
3
4class VoiceAgent {
5 constructor(token, apiKey) {
6 this.token = token;
7 this.apiKey = apiKey;
8 this.socket = null;
9 this.mediaRecorder = null;
10 this.audioContext = null;
11 }
12
13 connect(voice = 'mekdes') {
14 this.socket = io('wss://api.selamgpt.com/v1/audio/agent', {
15 query: {
16 token: this.token,
17 api_key: this.apiKey,
18 voice: voice
19 },
20 transports: ['websocket']
21 });
22
23 this.setupEventHandlers();
24 }
25
26 setupEventHandlers() {
27 this.socket.on('connected', (data) => {
28 console.log('Connected:', data);
29 this.startSession();
30 });
31
32 this.socket.on('session_started', (data) => {
33 console.log('Session started:', data);
34 this.startAudioCapture();
35 });
36
37 this.socket.on('transcription', (data) => {
38 console.log('User:', data.text);
39 });
40
41 this.socket.on('response', (data) => {
42 console.log('AI:', data.text);
43 });
44
45 this.socket.on('audio_chunk', (data) => {
46 this.playAudioChunk(data.audio);
47 });
48
49 this.socket.on('error', (error) => {
50 console.error('Error:', error);
51 });
52 }
53
54 startSession() {
55 this.socket.emit('start_session', {
56 voice: 'mekdes',
57 language: 'am',
58 instructions: 'You are a helpful voice assistant.',
59 vad_threshold: 600,
60 interrupt_threshold: 8
61 // Note: Use mekdes for Amharic, ubax for Somali, alloy for English
62 });
63 }
64
65 async startAudioCapture() {
66 const stream = await navigator.mediaDevices.getUserMedia({
67 audio: {
68 sampleRate: 16000,
69 channelCount: 1,
70 echoCancellation: true,
71 noiseSuppression: true
72 }
73 });
74
75 this.mediaRecorder = new MediaRecorder(stream);
76
77 this.mediaRecorder.ondataavailable = (event) => {
78 const reader = new FileReader();
79 reader.onload = () => {
80 const audioData = reader.result.split(',')[1];
81 this.socket.emit('audio_chunk', {
82 audio: audioData,
83 format: 'pcm16',
84 sample_rate: 16000
85 });
86 };
87 reader.readAsDataURL(event.data);
88 };
89
90 this.mediaRecorder.start(100); // 100ms chunks
91 }
92
93 playAudioChunk(base64Audio) {
94 // Decode and play audio
95 const audioData = atob(base64Audio);
96 const audioArray = new Uint8Array(audioData.length);
97 for (let i = 0; i < audioData.length; i++) {
98 audioArray[i] = audioData.charCodeAt(i);
99 }
100
101 // Create audio blob and play
102 const blob = new Blob([audioArray], { type: 'audio/mp3' });
103 const url = URL.createObjectURL(blob);
104 const audio = new Audio(url);
105 audio.play();
106 }
107
108 interrupt() {
109 this.socket.emit('control', { action: 'interrupt' });
110 }
111
112 endSession() {
113 if (this.mediaRecorder) {
114 this.mediaRecorder.stop();
115 }
116 this.socket.emit('end_session');
117 }
118
119 disconnect() {
120 if (this.socket) {
121 this.socket.disconnect();
122 }
123 }
124}
125
126// Usage
127const agent = new VoiceAgent('your-jwt-token', 'your-api-key');
128agent.connect('mekdes'); // Use Amharic voiceBest Practices
Audio Quality
- •Use high-quality microphone with noise cancellation
- •Enable echo cancellation in audio capture settings
- •Send audio chunks at consistent intervals (100-200ms)
- •Maintain 16kHz sample rate for optimal recognition
Connection Management
- •Implement reconnection logic for network interruptions
- •Handle connection errors gracefully with user feedback
- •Always end sessions properly to free resources
- •Monitor connection status and notify users of issues
Conversation Design
- •Provide clear system instructions for consistent behavior
- •Use appropriate VAD threshold for your environment
- •Adjust interrupt threshold based on use case
- •Choose voices that match your target language and audience
Security
- •Never expose JWT tokens or API keys in client-side code
- •Use secure token exchange mechanisms
- •Implement proper session timeout handling
- •Validate all user inputs before sending to the agent
Performance
- •Buffer audio playback to prevent choppy output
- •Use Web Workers for audio processing to avoid blocking UI
- •Implement audio queue management for smooth playback
- •Monitor latency and adjust chunk sizes if needed
Error Handling
Handle common errors gracefully to provide a smooth user experience:
authentication_errorInvalid or missing JWT token or API key. Verify credentials and retry.
authorization_errorInsufficient tier access. Voice Agent requires OWNER tier.
validation_errorInvalid parameters or audio format. Check request format and retry.
session_errorSession not found or failed to create. Start a new session.
audio_errorAudio processing failed. Check audio format and quality.
Related Resources
Was this page helpful?