Voice Agent
Build real-time voice conversations with WebSocket-based bidirectional audio streaming, supporting multiple languages and intelligent interruption handling.
Overview
The Voice Agent API enables real-time voice conversations through WebSocket connections. It combines speech-to-text, language models, and text-to-speech into a seamless conversational experience with support for interruptions, translation, and multi-language conversations.
Real-Time
Bidirectional audio streaming with low latency for natural conversations.
Multi-Language
Auto-detect languages and translate conversations in real-time.
Interruption
Intelligent interruption handling for natural conversation flow.
Warning
Information
WebSocket Connection
Connect to the Voice Agent API using WebSocket at wss://api.selamgpt.com with namespace /v1/audio/agent
Connection Requirements
api_key- API key for authentication (required)voice- Voice ID for the agent (required)1import { io } from 'socket.io-client';
2
3// Connect to Voice Agent with namespace
4const socket = io('wss://api.selamgpt.com/v1/audio/agent', {
5 query: {
6
7 api_key: 'your-api-key',
8 voice: 'mekdes' // Amharic voice
9 },
10 transports: ['websocket']
11});
12
13// Handle connection
14socket.on('connected', (data) => {
15 console.log('Connected:', data);
16 // { status: 'connected', sid: '...', voice: 'mekdes', message: '...' }
17
18 // Start session
19 socket.emit('start_session', {
20 voice: 'mekdes',
21 language: 'am',
22 // Note: For Amharic, use mekdes/ameha/betty. GPT Audio voices only support English.
23 instructions: 'You are a helpful voice assistant.'
24 });
25});
26
27// Handle session started
28socket.on('session_started', (data) => {
29 console.log('Session started:', data);
30 // Now you can send audio chunks
31});
32
33// Handle errors
34socket.on('error', (error) => {
35 console.error('Error:', error);
36});Audio Format Requirements
Voice Agent requires specific audio formats for optimal performance:
Format
PCM1616-bit PCM audio
Sample Rate
16 kHz16,000 samples/second
Channels
MonoSingle channel audio
Information
Session Management
Voice Agent sessions follow a structured lifecycle with clear events and states:
1. Start Session
Initialize a new voice conversation session. For Ethiopian languages, use native voices (mekdes, ameha, betty for Amharic). GPT Audio voices (alloy, echo, etc.) only support English:
1socket.emit('start_session', {
2 voice: 'mekdes', // Amharic voice (required)
3 language: 'am', // Language code (optional, auto-detected)
4 instructions: 'You are a helpful assistant.', // System prompt (optional)
5 vad_threshold: 600, // Voice activity threshold (default: 600)
6 interrupt_threshold: 8 // Interruption sensitivity (default: 8)
7 // Note: Use mekdes/ameha/betty for Amharic, ubax/muuse for Somali, etc.
8});2. Send Audio
Stream audio chunks to the agent:
1// Capture audio from microphone
2navigator.mediaDevices.getUserMedia({ audio: true })
3 .then(stream => {
4 const mediaRecorder = new MediaRecorder(stream);
5
6 mediaRecorder.ondataavailable = (event) => {
7 const reader = new FileReader();
8 reader.onload = () => {
9 const audioData = reader.result.split(',')[1]; // Base64
10
11 socket.emit('audio_chunk', {
12 audio: audioData,
13 format: 'pcm16',
14 sample_rate: 16000
15 });
16 };
17 reader.readAsDataURL(event.data);
18 };
19
20 mediaRecorder.start(100); // Send chunks every 100ms
21 });3. Receive Responses
Listen for transcriptions, text responses, and audio:
1// Transcription of user speech
2socket.on('transcription', (data) => {
3 console.log('User said:', data.text);
4 // { text: 'Hello, how are you?', is_final: true }
5});
6
7// AI text response
8socket.on('response', (data) => {
9 console.log('AI response:', data.text);
10 // { text: 'I am doing well, thank you!' }
11});
12
13// Audio response chunks
14socket.on('audio_chunk', (data) => {
15 const audioData = atob(data.audio); // Decode base64
16 // Play audio chunk
17 playAudioChunk(audioData);
18});
19
20// Audio streaming complete
21socket.on('audio_end', (data) => {
22 console.log('Audio complete:', data.total_chunks);
23});4. End Session
Gracefully close the session:
1socket.emit('end_session');
2
3socket.on('session_ended', (data) => {
4 console.log('Session ended:', data);
5 // { session_id: '...', status: 'ended' }
6});Interruption Handling
Voice Agent supports intelligent interruption handling for natural conversations. Users can interrupt the AI while it's speaking, just like in human conversations.
Control Actions
interruptStop the current speech immediately
start_listeningResume voice detection
stop_listeningPause voice detection
pausePause the session
resumeResume the session
1// Interrupt current speech
2socket.emit('control', { action: 'interrupt' });
3
4// Pause listening
5socket.emit('control', { action: 'stop_listening' });
6
7// Resume listening
8socket.emit('control', { action: 'start_listening' });
9
10// Handle control acknowledgment
11socket.on('control_ack', (data) => {
12 console.log('Control executed:', data);
13 // { action: 'interrupt', status: 'executed' }
14});Tip
interrupt_threshold parameter controls how sensitive the interruption detection is. Lower values (1-5) make it easier to interrupt, while higher values (10-15) require more confident speech.Translation Support
Voice Agent supports real-time translation with automatic language detection. Speak in any supported language and get responses in your preferred language.
Auto-Detection Mode
Automatically detect the user's language and respond in the same language. Use native voices for each language.
1socket.emit('start_session', {
2 voice: 'mekdes', // Amharic voice
3 // No language specified - auto-detect
4 // Will detect and respond in Amharic
5});Target Language Mode
Specify a target language for responses. Use the appropriate native voice for each language.
1socket.emit('start_session', {
2 voice: 'mekdes', // Amharic voice
3 language: 'am', // Respond in Amharic
4 // Use mekdes/ameha/betty for Amharic
5});Supported Languages
enEnglishamAmharicsoSomalitiTigrinyaomOromoAccess Requirements
Voice Agent is currently in limited access for select developers:
Current Access Status
Limited Access Program
Available to approved developers with unlimited usage
Authentication Required
API key authentication
No Rate Limits
Unlimited concurrent sessions and requests
Information
Complete Example
Here's a complete example of a Voice Agent implementation:
1import { io } from 'socket.io-client';
2
3class VoiceAgent {
4 constructor(apiKey) {
5 this.apiKey = apiKey;
6 this.socket = null;
7 this.mediaRecorder = null;
8 this.isRecording = false;
9 }
10
11 connect(voice = 'mekdes') {
12 this.socket = io('wss://api.selamgpt.com/v1/audio/agent', {
13 query: {
14 api_key: this.apiKey,
15 voice: voice
16 },
17 transports: ['websocket']
18 });
19
20 this.setupEventHandlers();
21 }
22
23 setupEventHandlers() {
24 this.socket.on('connected', (data) => {
25 console.log('Connected:', data);
26 this.startSession();
27 });
28
29 this.socket.on('session_started', (data) => {
30 console.log('Session started:', data);
31 console.log('Ready for Push-to-Talk');
32 });
33
34 this.socket.on('transcription', (data) => {
35 console.log('User:', data.text);
36 });
37
38 this.socket.on('response', (data) => {
39 console.log('AI:', data.text);
40 });
41
42 this.socket.on('audio_chunk', (data) => {
43 this.playAudioChunk(data.audio);
44 });
45
46 this.socket.on('error', (error) => {
47 console.error('Error:', error);
48 });
49 }
50
51 startSession() {
52 this.socket.emit('start_session', {
53 voice: 'mekdes',
54 language: 'am',
55 instructions: 'You are a helpful voice assistant.',
56 vad_threshold: 600,
57 interrupt_threshold: 8
58 });
59 }
60
61 // Push-to-Talk: Call this when user holds the button
62 async startTalking() {
63 if (this.isRecording) return;
64 this.isRecording = true;
65
66 // Interrupt any current AI speech
67 this.socket.emit('control', { action: 'interrupt' });
68
69 try {
70 const stream = await navigator.mediaDevices.getUserMedia({
71 audio: {
72 sampleRate: 16000,
73 channelCount: 1,
74 echoCancellation: true,
75 noiseSuppression: true,
76 autoGainControl: true
77 }
78 });
79
80 this.mediaRecorder = new MediaRecorder(stream);
81
82 this.mediaRecorder.ondataavailable = (event) => {
83 if (event.data.size > 0 && this.socket) {
84 const reader = new FileReader();
85 reader.onload = () => {
86 const audioData = reader.result.split(',')[1];
87 this.socket.emit('audio_chunk', {
88 audio: audioData,
89 format: 'pcm16',
90 sample_rate: 16000
91 });
92 };
93 reader.readAsDataURL(event.data);
94 }
95 };
96
97 // Send chunks every 100ms
98 this.mediaRecorder.start(100);
99 console.log('Listening...');
100
101 } catch (err) {
102 console.error('Error accessing microphone:', err);
103 this.isRecording = false;
104 }
105 }
106
107 // Push-to-Talk: Call this when user releases the button
108 stopTalking() {
109 if (!this.isRecording || !this.mediaRecorder) return;
110
111 this.mediaRecorder.stop();
112 this.mediaRecorder.stream.getTracks().forEach(track => track.stop());
113 this.mediaRecorder = null;
114 this.isRecording = false;
115
116 // Optional: Send a signal that speaking is done if needed by your logic,
117 // though the server typically handles VAD or end of stream.
118 console.log('Stopped listening');
119 }
120
121 playAudioChunk(base64Audio) {
122 const audioData = atob(base64Audio);
123 const audioArray = new Uint8Array(audioData.length);
124 for (let i = 0; i < audioData.length; i++) {
125 audioArray[i] = audioData.charCodeAt(i);
126 }
127
128 const blob = new Blob([audioArray], { type: 'audio/mp3' });
129 const url = URL.createObjectURL(blob);
130 const audio = new Audio(url);
131 audio.play().catch(e => console.error('Playback failed:', e));
132 }
133
134 disconnect() {
135 if (this.socket) {
136 this.socket.disconnect();
137 }
138 }
139}
140
141// Usage Example (React/HTML)
142/*
143const agent = new VoiceAgent('your-api-key');
144agent.connect();
145
146// Bind to a button
147<button
148 onMouseDown={() => agent.startTalking()}
149 onMouseUp={() => agent.stopTalking()}
150 onMouseLeave={() => agent.stopTalking()}
151>
152 Hold to Speak
153</button>
154*/Best Practices
Audio Quality
- •Use high-quality microphone with noise cancellation
- •Enable echo cancellation in audio capture settings
- •Send audio chunks at consistent intervals (100-200ms)
- •Maintain 16kHz sample rate for optimal recognition
Connection Management
- •Implement reconnection logic for network interruptions
- •Handle connection errors gracefully with user feedback
- •Always end sessions properly to free resources
- •Monitor connection status and notify users of issues
Conversation Design
- •Provide clear system instructions for consistent behavior
- •Use appropriate VAD threshold for your environment
- •Adjust interrupt threshold based on use case
- •Choose voices that match your target language and audience
Security
- •Never expose API keys in client-side code
- •
- •Implement proper session timeout handling
- •Validate all user inputs before sending to the agent
Performance
- •Buffer audio playback to prevent choppy output
- •Use Web Workers for audio processing to avoid blocking UI
- •Implement audio queue management for smooth playback
- •Monitor latency and adjust chunk sizes if needed
Error Handling
Handle common errors gracefully to provide a smooth user experience:
authentication_errorInvalid or missing API key. Verify credentials and retry.
authorization_errorInsufficient tier access. Voice Agent requires OWNER tier.
validation_errorInvalid parameters or audio format. Check request format and retry.
session_errorSession not found or failed to create. Start a new session.
audio_errorAudio processing failed. Check audio format and quality.
Related Resources
Was this page helpful?