Voice Agent

Build real-time voice conversations with WebSocket-based bidirectional audio streaming, supporting multiple languages and intelligent interruption handling.

Overview

The Voice Agent API enables real-time voice conversations through WebSocket connections. It combines speech-to-text, language models, and text-to-speech into a seamless conversational experience with support for interruptions, translation, and multi-language conversations.

Real-Time

Bidirectional audio streaming with low latency for natural conversations.

Multi-Language

Auto-detect languages and translate conversations in real-time.

Interruption

Intelligent interruption handling for natural conversation flow.

Warning

Voice Agent is currently in limited access and available to select developers only. This feature is designed for the SelamGPT platform and requires an API key. Contact us for access.

Information

GPT Audio Voices: Voices like alloy, echo, fable, onyx, nova, and shimmer are GPT Audio voices with built-in LLM capabilities. They only support English and don't require a separate model parameter. For other languages, use the appropriate native voices (mekdes for Amharic, ubax for Somali, etc.).

WebSocket Connection

Connect to the Voice Agent API using WebSocket at wss://api.selamgpt.com with namespace /v1/audio/agent

Connection Requirements

api_key- API key for authentication (required)
voice- Voice ID for the agent (required)
1import { io } from 'socket.io-client';
2
3// Connect to Voice Agent with namespace
4const socket = io('wss://api.selamgpt.com/v1/audio/agent', {
5  query: {
6
7    api_key: 'your-api-key',
8    voice: 'mekdes'  // Amharic voice
9  },
10  transports: ['websocket']
11});
12
13// Handle connection
14socket.on('connected', (data) => {
15  console.log('Connected:', data);
16  // { status: 'connected', sid: '...', voice: 'mekdes', message: '...' }
17  
18  // Start session
19  socket.emit('start_session', {
20    voice: 'mekdes',
21    language: 'am',
22    // Note: For Amharic, use mekdes/ameha/betty. GPT Audio voices only support English.
23    instructions: 'You are a helpful voice assistant.'
24  });
25});
26
27// Handle session started
28socket.on('session_started', (data) => {
29  console.log('Session started:', data);
30  // Now you can send audio chunks
31});
32
33// Handle errors
34socket.on('error', (error) => {
35  console.error('Error:', error);
36});

Audio Format Requirements

Voice Agent requires specific audio formats for optimal performance:

Format

PCM16

16-bit PCM audio

Sample Rate

16 kHz

16,000 samples/second

Channels

Mono

Single channel audio

Information

Audio chunks should be sent as base64-encoded PCM16 data. The Voice Agent will automatically handle voice activity detection (VAD) and speech recognition.

Session Management

Voice Agent sessions follow a structured lifecycle with clear events and states:

1. Start Session

Initialize a new voice conversation session. For Ethiopian languages, use native voices (mekdes, ameha, betty for Amharic). GPT Audio voices (alloy, echo, etc.) only support English:

1socket.emit('start_session', {
2  voice: 'mekdes',             // Amharic voice (required)
3  language: 'am',              // Language code (optional, auto-detected)
4  instructions: 'You are a helpful assistant.',  // System prompt (optional)
5  vad_threshold: 600,          // Voice activity threshold (default: 600)
6  interrupt_threshold: 8       // Interruption sensitivity (default: 8)
7  // Note: Use mekdes/ameha/betty for Amharic, ubax/muuse for Somali, etc.
8});

2. Send Audio

Stream audio chunks to the agent:

1// Capture audio from microphone
2navigator.mediaDevices.getUserMedia({ audio: true })
3  .then(stream => {
4    const mediaRecorder = new MediaRecorder(stream);
5    
6    mediaRecorder.ondataavailable = (event) => {
7      const reader = new FileReader();
8      reader.onload = () => {
9        const audioData = reader.result.split(',')[1]; // Base64
10        
11        socket.emit('audio_chunk', {
12          audio: audioData,
13          format: 'pcm16',
14          sample_rate: 16000
15        });
16      };
17      reader.readAsDataURL(event.data);
18    };
19    
20    mediaRecorder.start(100); // Send chunks every 100ms
21  });

3. Receive Responses

Listen for transcriptions, text responses, and audio:

1// Transcription of user speech
2socket.on('transcription', (data) => {
3  console.log('User said:', data.text);
4  // { text: 'Hello, how are you?', is_final: true }
5});
6
7// AI text response
8socket.on('response', (data) => {
9  console.log('AI response:', data.text);
10  // { text: 'I am doing well, thank you!' }
11});
12
13// Audio response chunks
14socket.on('audio_chunk', (data) => {
15  const audioData = atob(data.audio); // Decode base64
16  // Play audio chunk
17  playAudioChunk(audioData);
18});
19
20// Audio streaming complete
21socket.on('audio_end', (data) => {
22  console.log('Audio complete:', data.total_chunks);
23});

4. End Session

Gracefully close the session:

1socket.emit('end_session');
2
3socket.on('session_ended', (data) => {
4  console.log('Session ended:', data);
5  // { session_id: '...', status: 'ended' }
6});

Interruption Handling

Voice Agent supports intelligent interruption handling for natural conversations. Users can interrupt the AI while it's speaking, just like in human conversations.

Control Actions

interrupt

Stop the current speech immediately

start_listening

Resume voice detection

stop_listening

Pause voice detection

pause

Pause the session

resume

Resume the session

1// Interrupt current speech
2socket.emit('control', { action: 'interrupt' });
3
4// Pause listening
5socket.emit('control', { action: 'stop_listening' });
6
7// Resume listening
8socket.emit('control', { action: 'start_listening' });
9
10// Handle control acknowledgment
11socket.on('control_ack', (data) => {
12  console.log('Control executed:', data);
13  // { action: 'interrupt', status: 'executed' }
14});

Tip

The interrupt_threshold parameter controls how sensitive the interruption detection is. Lower values (1-5) make it easier to interrupt, while higher values (10-15) require more confident speech.

Translation Support

Voice Agent supports real-time translation with automatic language detection. Speak in any supported language and get responses in your preferred language.

Auto-Detection Mode

Automatically detect the user's language and respond in the same language. Use native voices for each language.

1socket.emit('start_session', {
2  voice: 'mekdes',  // Amharic voice
3  // No language specified - auto-detect
4  // Will detect and respond in Amharic
5});

Target Language Mode

Specify a target language for responses. Use the appropriate native voice for each language.

1socket.emit('start_session', {
2  voice: 'mekdes',  // Amharic voice
3  language: 'am',   // Respond in Amharic
4  // Use mekdes/ameha/betty for Amharic
5});

Supported Languages

enEnglish
amAmharic
soSomali
tiTigrinya
omOromo

Access Requirements

Voice Agent is currently in limited access for select developers:

Current Access Status

Limited Access Program

Available to approved developers with unlimited usage

Authentication Required

API key authentication

No Rate Limits

Unlimited concurrent sessions and requests

Information

Interested in Voice Agent access? Contact our team to discuss your use case and get approved for the limited access program.

Complete Example

Here's a complete example of a Voice Agent implementation:

1import { io } from 'socket.io-client';
2
3class VoiceAgent {
4  constructor(apiKey) {
5    this.apiKey = apiKey;
6    this.socket = null;
7    this.mediaRecorder = null;
8    this.isRecording = false;
9  }
10
11  connect(voice = 'mekdes') {
12    this.socket = io('wss://api.selamgpt.com/v1/audio/agent', {
13      query: {
14        api_key: this.apiKey,
15        voice: voice
16      },
17      transports: ['websocket']
18    });
19
20    this.setupEventHandlers();
21  }
22
23  setupEventHandlers() {
24    this.socket.on('connected', (data) => {
25      console.log('Connected:', data);
26      this.startSession();
27    });
28
29    this.socket.on('session_started', (data) => {
30      console.log('Session started:', data);
31      console.log('Ready for Push-to-Talk');
32    });
33
34    this.socket.on('transcription', (data) => {
35      console.log('User:', data.text);
36    });
37
38    this.socket.on('response', (data) => {
39      console.log('AI:', data.text);
40    });
41
42    this.socket.on('audio_chunk', (data) => {
43      this.playAudioChunk(data.audio);
44    });
45
46    this.socket.on('error', (error) => {
47      console.error('Error:', error);
48    });
49  }
50
51  startSession() {
52    this.socket.emit('start_session', {
53      voice: 'mekdes',
54      language: 'am',
55      instructions: 'You are a helpful voice assistant.',
56      vad_threshold: 600,
57      interrupt_threshold: 8
58    });
59  }
60
61  // Push-to-Talk: Call this when user holds the button
62  async startTalking() {
63    if (this.isRecording) return;
64    this.isRecording = true;
65
66    // Interrupt any current AI speech
67    this.socket.emit('control', { action: 'interrupt' });
68
69    try {
70      const stream = await navigator.mediaDevices.getUserMedia({ 
71        audio: {
72          sampleRate: 16000,
73          channelCount: 1,
74          echoCancellation: true,
75          noiseSuppression: true,
76          autoGainControl: true
77        }
78      });
79
80      this.mediaRecorder = new MediaRecorder(stream);
81      
82      this.mediaRecorder.ondataavailable = (event) => {
83        if (event.data.size > 0 && this.socket) {
84          const reader = new FileReader();
85          reader.onload = () => {
86            const audioData = reader.result.split(',')[1];
87            this.socket.emit('audio_chunk', {
88              audio: audioData,
89              format: 'pcm16',
90              sample_rate: 16000
91            });
92          };
93          reader.readAsDataURL(event.data);
94        }
95      };
96
97      // Send chunks every 100ms
98      this.mediaRecorder.start(100);
99      console.log('Listening...');
100
101    } catch (err) {
102      console.error('Error accessing microphone:', err);
103      this.isRecording = false;
104    }
105  }
106
107  // Push-to-Talk: Call this when user releases the button
108  stopTalking() {
109    if (!this.isRecording || !this.mediaRecorder) return;
110    
111    this.mediaRecorder.stop();
112    this.mediaRecorder.stream.getTracks().forEach(track => track.stop());
113    this.mediaRecorder = null;
114    this.isRecording = false;
115    
116    // Optional: Send a signal that speaking is done if needed by your logic,
117    // though the server typically handles VAD or end of stream.
118    console.log('Stopped listening');
119  }
120
121  playAudioChunk(base64Audio) {
122    const audioData = atob(base64Audio);
123    const audioArray = new Uint8Array(audioData.length);
124    for (let i = 0; i < audioData.length; i++) {
125      audioArray[i] = audioData.charCodeAt(i);
126    }
127    
128    const blob = new Blob([audioArray], { type: 'audio/mp3' });
129    const url = URL.createObjectURL(blob);
130    const audio = new Audio(url);
131    audio.play().catch(e => console.error('Playback failed:', e));
132  }
133
134  disconnect() {
135    if (this.socket) {
136      this.socket.disconnect();
137    }
138  }
139}
140
141// Usage Example (React/HTML)
142/*
143const agent = new VoiceAgent('your-api-key');
144agent.connect();
145
146// Bind to a button
147<button 
148  onMouseDown={() => agent.startTalking()}
149  onMouseUp={() => agent.stopTalking()}
150  onMouseLeave={() => agent.stopTalking()}
151>
152  Hold to Speak
153</button>
154*/

Best Practices

Audio Quality

  • Use high-quality microphone with noise cancellation
  • Enable echo cancellation in audio capture settings
  • Send audio chunks at consistent intervals (100-200ms)
  • Maintain 16kHz sample rate for optimal recognition

Connection Management

  • Implement reconnection logic for network interruptions
  • Handle connection errors gracefully with user feedback
  • Always end sessions properly to free resources
  • Monitor connection status and notify users of issues

Conversation Design

  • Provide clear system instructions for consistent behavior
  • Use appropriate VAD threshold for your environment
  • Adjust interrupt threshold based on use case
  • Choose voices that match your target language and audience

Security

  • Never expose API keys in client-side code
  • Implement proper session timeout handling
  • Validate all user inputs before sending to the agent

Performance

  • Buffer audio playback to prevent choppy output
  • Use Web Workers for audio processing to avoid blocking UI
  • Implement audio queue management for smooth playback
  • Monitor latency and adjust chunk sizes if needed

Error Handling

Handle common errors gracefully to provide a smooth user experience:

authentication_error

Invalid or missing API key. Verify credentials and retry.

authorization_error

Insufficient tier access. Voice Agent requires OWNER tier.

validation_error

Invalid parameters or audio format. Check request format and retry.

session_error

Session not found or failed to create. Start a new session.

audio_error

Audio processing failed. Check audio format and quality.

Related Resources

Was this page helpful?