Voice Agent

Build real-time voice conversations with WebSocket-based bidirectional audio streaming, supporting multiple languages and intelligent interruption handling.

Overview

The Voice Agent API enables real-time voice conversations through WebSocket connections. It combines speech-to-text, language models, and text-to-speech into a seamless conversational experience with support for interruptions, translation, and multi-language conversations.

Real-Time

Bidirectional audio streaming with low latency for natural conversations.

Multi-Language

Auto-detect languages and translate conversations in real-time.

Interruption

Intelligent interruption handling for natural conversation flow.

Warning

Voice Agent is currently in limited access and available to select developers only. This feature is designed for the SelamGPT platform and requires both JWT authentication and API key. Contact us for access.

Information

GPT Audio Voices: Voices like alloy, echo, fable, onyx, nova, and shimmer are GPT Audio voices with built-in LLM capabilities. They only support English and don't require a separate model parameter. For other languages, use the appropriate native voices (mekdes for Amharic, ubax for Somali, etc.).

WebSocket Connection

Connect to the Voice Agent API using WebSocket at wss://api.selamgpt.com with namespace /v1/audio/agent

Connection Requirements

token- JWT Bearer token (required)
api_key- API key for authentication (required)
voice- Voice ID for the agent (required)
1import { io } from 'socket.io-client';
2
3// Connect to Voice Agent with namespace
4const socket = io('wss://api.selamgpt.com/v1/audio/agent', {
5  query: {
6    token: 'your-jwt-token',
7    api_key: 'your-api-key',
8    voice: 'mekdes'  // Amharic voice
9  },
10  transports: ['websocket']
11});
12
13// Handle connection
14socket.on('connected', (data) => {
15  console.log('Connected:', data);
16  // { status: 'connected', sid: '...', voice: 'mekdes', message: '...' }
17  
18  // Start session
19  socket.emit('start_session', {
20    voice: 'mekdes',
21    language: 'am',
22    // Note: For Amharic, use mekdes/ameha/betty. GPT Audio voices only support English.
23    instructions: 'You are a helpful voice assistant.'
24  });
25});
26
27// Handle session started
28socket.on('session_started', (data) => {
29  console.log('Session started:', data);
30  // Now you can send audio chunks
31});
32
33// Handle errors
34socket.on('error', (error) => {
35  console.error('Error:', error);
36});

Audio Format Requirements

Voice Agent requires specific audio formats for optimal performance:

Format

PCM16

16-bit PCM audio

Sample Rate

16 kHz

16,000 samples/second

Channels

Mono

Single channel audio

Information

Audio chunks should be sent as base64-encoded PCM16 data. The Voice Agent will automatically handle voice activity detection (VAD) and speech recognition.

Session Management

Voice Agent sessions follow a structured lifecycle with clear events and states:

1. Start Session

Initialize a new voice conversation session. For Ethiopian languages, use native voices (mekdes, ameha, betty for Amharic). GPT Audio voices (alloy, echo, etc.) only support English:

1socket.emit('start_session', {
2  voice: 'mekdes',             // Amharic voice (required)
3  language: 'am',              // Language code (optional, auto-detected)
4  instructions: 'You are a helpful assistant.',  // System prompt (optional)
5  vad_threshold: 600,          // Voice activity threshold (default: 600)
6  interrupt_threshold: 8       // Interruption sensitivity (default: 8)
7  // Note: Use mekdes/ameha/betty for Amharic, ubax/muuse for Somali, etc.
8});

2. Send Audio

Stream audio chunks to the agent:

1// Capture audio from microphone
2navigator.mediaDevices.getUserMedia({ audio: true })
3  .then(stream => {
4    const mediaRecorder = new MediaRecorder(stream);
5    
6    mediaRecorder.ondataavailable = (event) => {
7      const reader = new FileReader();
8      reader.onload = () => {
9        const audioData = reader.result.split(',')[1]; // Base64
10        
11        socket.emit('audio_chunk', {
12          audio: audioData,
13          format: 'pcm16',
14          sample_rate: 16000
15        });
16      };
17      reader.readAsDataURL(event.data);
18    };
19    
20    mediaRecorder.start(100); // Send chunks every 100ms
21  });

3. Receive Responses

Listen for transcriptions, text responses, and audio:

1// Transcription of user speech
2socket.on('transcription', (data) => {
3  console.log('User said:', data.text);
4  // { text: 'Hello, how are you?', is_final: true }
5});
6
7// AI text response
8socket.on('response', (data) => {
9  console.log('AI response:', data.text);
10  // { text: 'I am doing well, thank you!' }
11});
12
13// Audio response chunks
14socket.on('audio_chunk', (data) => {
15  const audioData = atob(data.audio); // Decode base64
16  // Play audio chunk
17  playAudioChunk(audioData);
18});
19
20// Audio streaming complete
21socket.on('audio_end', (data) => {
22  console.log('Audio complete:', data.total_chunks);
23});

4. End Session

Gracefully close the session:

1socket.emit('end_session');
2
3socket.on('session_ended', (data) => {
4  console.log('Session ended:', data);
5  // { session_id: '...', status: 'ended' }
6});

Interruption Handling

Voice Agent supports intelligent interruption handling for natural conversations. Users can interrupt the AI while it's speaking, just like in human conversations.

Control Actions

interrupt

Stop the current speech immediately

start_listening

Resume voice detection

stop_listening

Pause voice detection

pause

Pause the session

resume

Resume the session

1// Interrupt current speech
2socket.emit('control', { action: 'interrupt' });
3
4// Pause listening
5socket.emit('control', { action: 'stop_listening' });
6
7// Resume listening
8socket.emit('control', { action: 'start_listening' });
9
10// Handle control acknowledgment
11socket.on('control_ack', (data) => {
12  console.log('Control executed:', data);
13  // { action: 'interrupt', status: 'executed' }
14});

Tip

The interrupt_threshold parameter controls how sensitive the interruption detection is. Lower values (1-5) make it easier to interrupt, while higher values (10-15) require more confident speech.

Translation Support

Voice Agent supports real-time translation with automatic language detection. Speak in any supported language and get responses in your preferred language.

Auto-Detection Mode

Automatically detect the user's language and respond in the same language. Use native voices for each language.

1socket.emit('start_session', {
2  voice: 'mekdes',  // Amharic voice
3  // No language specified - auto-detect
4  // Will detect and respond in Amharic
5});

Target Language Mode

Specify a target language for responses. Use the appropriate native voice for each language.

1socket.emit('start_session', {
2  voice: 'mekdes',  // Amharic voice
3  language: 'am',   // Respond in Amharic
4  // Use mekdes/ameha/betty for Amharic
5});

Supported Languages

enEnglish
amAmharic
soSomali
tiTigrinya
omOromo
aaAfar

Access Requirements

Voice Agent is currently in limited access for select developers:

Current Access Status

Limited Access Program

Available to approved developers with unlimited usage

Authentication Required

Both JWT token and API key authentication

No Rate Limits

Unlimited concurrent sessions and requests

Information

Interested in Voice Agent access? Contact our team to discuss your use case and get approved for the limited access program.

Complete Example

Here's a complete example of a Voice Agent implementation:

1import { io } from 'socket.io-client';
2import PageFeedback from '@/components/docs/PageFeedback';
3
4class VoiceAgent {
5  constructor(token, apiKey) {
6    this.token = token;
7    this.apiKey = apiKey;
8    this.socket = null;
9    this.mediaRecorder = null;
10    this.audioContext = null;
11  }
12
13  connect(voice = 'mekdes') {
14    this.socket = io('wss://api.selamgpt.com/v1/audio/agent', {
15      query: {
16        token: this.token,
17        api_key: this.apiKey,
18        voice: voice
19      },
20      transports: ['websocket']
21    });
22
23    this.setupEventHandlers();
24  }
25
26  setupEventHandlers() {
27    this.socket.on('connected', (data) => {
28      console.log('Connected:', data);
29      this.startSession();
30    });
31
32    this.socket.on('session_started', (data) => {
33      console.log('Session started:', data);
34      this.startAudioCapture();
35    });
36
37    this.socket.on('transcription', (data) => {
38      console.log('User:', data.text);
39    });
40
41    this.socket.on('response', (data) => {
42      console.log('AI:', data.text);
43    });
44
45    this.socket.on('audio_chunk', (data) => {
46      this.playAudioChunk(data.audio);
47    });
48
49    this.socket.on('error', (error) => {
50      console.error('Error:', error);
51    });
52  }
53
54  startSession() {
55    this.socket.emit('start_session', {
56      voice: 'mekdes',
57      language: 'am',
58      instructions: 'You are a helpful voice assistant.',
59      vad_threshold: 600,
60      interrupt_threshold: 8
61      // Note: Use mekdes for Amharic, ubax for Somali, alloy for English
62    });
63  }
64
65  async startAudioCapture() {
66    const stream = await navigator.mediaDevices.getUserMedia({ 
67      audio: {
68        sampleRate: 16000,
69        channelCount: 1,
70        echoCancellation: true,
71        noiseSuppression: true
72      }
73    });
74
75    this.mediaRecorder = new MediaRecorder(stream);
76    
77    this.mediaRecorder.ondataavailable = (event) => {
78      const reader = new FileReader();
79      reader.onload = () => {
80        const audioData = reader.result.split(',')[1];
81        this.socket.emit('audio_chunk', {
82          audio: audioData,
83          format: 'pcm16',
84          sample_rate: 16000
85        });
86      };
87      reader.readAsDataURL(event.data);
88    };
89
90    this.mediaRecorder.start(100); // 100ms chunks
91  }
92
93  playAudioChunk(base64Audio) {
94    // Decode and play audio
95    const audioData = atob(base64Audio);
96    const audioArray = new Uint8Array(audioData.length);
97    for (let i = 0; i < audioData.length; i++) {
98      audioArray[i] = audioData.charCodeAt(i);
99    }
100    
101    // Create audio blob and play
102    const blob = new Blob([audioArray], { type: 'audio/mp3' });
103    const url = URL.createObjectURL(blob);
104    const audio = new Audio(url);
105    audio.play();
106  }
107
108  interrupt() {
109    this.socket.emit('control', { action: 'interrupt' });
110  }
111
112  endSession() {
113    if (this.mediaRecorder) {
114      this.mediaRecorder.stop();
115    }
116    this.socket.emit('end_session');
117  }
118
119  disconnect() {
120    if (this.socket) {
121      this.socket.disconnect();
122    }
123  }
124}
125
126// Usage
127const agent = new VoiceAgent('your-jwt-token', 'your-api-key');
128agent.connect('mekdes');  // Use Amharic voice

Best Practices

Audio Quality

  • Use high-quality microphone with noise cancellation
  • Enable echo cancellation in audio capture settings
  • Send audio chunks at consistent intervals (100-200ms)
  • Maintain 16kHz sample rate for optimal recognition

Connection Management

  • Implement reconnection logic for network interruptions
  • Handle connection errors gracefully with user feedback
  • Always end sessions properly to free resources
  • Monitor connection status and notify users of issues

Conversation Design

  • Provide clear system instructions for consistent behavior
  • Use appropriate VAD threshold for your environment
  • Adjust interrupt threshold based on use case
  • Choose voices that match your target language and audience

Security

  • Never expose JWT tokens or API keys in client-side code
  • Use secure token exchange mechanisms
  • Implement proper session timeout handling
  • Validate all user inputs before sending to the agent

Performance

  • Buffer audio playback to prevent choppy output
  • Use Web Workers for audio processing to avoid blocking UI
  • Implement audio queue management for smooth playback
  • Monitor latency and adjust chunk sizes if needed

Error Handling

Handle common errors gracefully to provide a smooth user experience:

authentication_error

Invalid or missing JWT token or API key. Verify credentials and retry.

authorization_error

Insufficient tier access. Voice Agent requires OWNER tier.

validation_error

Invalid parameters or audio format. Check request format and retry.

session_error

Session not found or failed to create. Start a new session.

audio_error

Audio processing failed. Check audio format and quality.

Related Resources

Was this page helpful?