Streaming
Learn how to use Server-Sent Events (SSE) to stream chat completions in real-time, providing a better user experience for interactive applications.
What is Streaming?
Streaming allows you to receive the model's response incrementally as it's generated, rather than waiting for the entire response to complete. This creates a more responsive user experience, especially for longer responses.
Benefits of Streaming
- Better UX: Users see responses appear in real-time, similar to typing
- Faster Perceived Response: Users get feedback immediately, not after completion
- Interruptible: Users can stop generation early if they have enough information
- Lower Memory: Process chunks as they arrive instead of buffering the entire response
Information
Streaming uses Server-Sent Events (SSE), a standard protocol for server-to-client streaming over HTTP. The response is sent as a series of data: events.
Basic Streaming
Enable streaming by setting stream=true in your request. The response will be delivered as a series of chunks.
Basic Streaming Example
1from openai import OpenAI
2
3client = OpenAI(
4 api_key="your-api-key-here",
5 base_url="https://api.selamgpt.com/v1"
6)
7
8# Create a streaming request
9stream = client.chat.completions.create(
10 model="selam-turbo",
11 messages=[
12 {"role": "user", "content": "Explain machine learning in simple terms."}
13 ],
14 stream=True
15)
16
17# Process each chunk as it arrives
18for chunk in stream:
19 if chunk.choices[0].delta.content:
20 print(chunk.choices[0].delta.content, end="", flush=True)
21
22print() # New line at the endStream Response Format
Each chunk in the stream is a JSON object prefixed with data:. The stream ends with a data: [DONE] message.
data: {"id":"chatcmpl-123","object":"chat.completion.chunk","created":1694268190,"model":"selam-turbo","choices":[{"index":0,"delta":{"role":"assistant","content":""},"finish_reason":null}]}
data: {"id":"chatcmpl-123","object":"chat.completion.chunk","created":1694268190,"model":"selam-turbo","choices":[{"index":0,"delta":{"content":"Machine"},"finish_reason":null}]}
data: {"id":"chatcmpl-123","object":"chat.completion.chunk","created":1694268190,"model":"selam-turbo","choices":[{"index":0,"delta":{"content":" learning"},"finish_reason":null}]}
data: {"id":"chatcmpl-123","object":"chat.completion.chunk","created":1694268190,"model":"selam-turbo","choices":[{"index":0,"delta":{"content":" is"},"finish_reason":null}]}
...
data: {"id":"chatcmpl-123","object":"chat.completion.chunk","created":1694268190,"model":"selam-turbo","choices":[{"index":0,"delta":{},"finish_reason":"stop"}]}
data: [DONE]Information
The first chunk typically contains the role field. Subsequent chunks contain content deltas. The final chunk has a finish_reason.
Handling Stream Chunks
Process each chunk to extract the content delta and handle completion signals.
Advanced Stream Handling
1from openai import OpenAI
2
3client = OpenAI(
4 api_key="your-api-key-here",
5 base_url="https://api.selamgpt.com/v1"
6)
7
8stream = client.chat.completions.create(
9 model="selam-turbo",
10 messages=[
11 {"role": "user", "content": "Write a short story."}
12 ],
13 stream=True
14)
15
16full_response = ""
17finish_reason = None
18
19for chunk in stream:
20 # Extract delta content
21 delta = chunk.choices[0].delta
22
23 # Check for content
24 if delta.content:
25 content = delta.content
26 full_response += content
27 print(content, end="", flush=True)
28
29 # Check for finish reason
30 if chunk.choices[0].finish_reason:
31 finish_reason = chunk.choices[0].finish_reason
32
33print(f"\n\nFinish reason: {finish_reason}")
34print(f"Total tokens: {len(full_response.split())}")Error Handling in Streams
Implement robust error handling to manage network issues, rate limits, and other errors during streaming.
Error Handling Example
1from openai import OpenAI, APIError, RateLimitError, APIConnectionError
2import time
3
4client = OpenAI(
5 api_key="your-api-key-here",
6 base_url="https://api.selamgpt.com/v1"
7)
8
9def stream_with_retry(messages, max_retries=3):
10 """Stream with automatic retry on transient errors."""
11 for attempt in range(max_retries):
12 try:
13 stream = client.chat.completions.create(
14 model="selam-turbo",
15 messages=messages,
16 stream=True
17 )
18
19 for chunk in stream:
20 if chunk.choices[0].delta.content:
21 yield chunk.choices[0].delta.content
22
23 return # Success, exit function
24
25 except RateLimitError as e:
26 if attempt < max_retries - 1:
27 wait_time = 2 ** attempt # Exponential backoff
28 print(f"\nRate limit hit. Retrying in {wait_time}s...")
29 time.sleep(wait_time)
30 else:
31 raise
32
33 except APIConnectionError as e:
34 if attempt < max_retries - 1:
35 print(f"\nConnection error. Retrying...")
36 time.sleep(1)
37 else:
38 raise
39
40 except APIError as e:
41 print(f"\nAPI error: {e}")
42 raise
43
44# Usage
45try:
46 for content in stream_with_retry([
47 {"role": "user", "content": "Tell me a joke."}
48 ]):
49 print(content, end="", flush=True)
50except Exception as e:
51 print(f"\nFailed after retries: {e}")Warning
Important: Always implement timeout handling for streams. A stuck connection can hang indefinitely. Set reasonable timeouts and implement retry logic with exponential backoff.
Frontend Integration
Here's how to integrate streaming into a React application for a chat interface.
React Streaming Example
1import { useState } from 'react';
2import OpenAI from 'openai';
3import PageFeedback from '@/components/docs/PageFeedback';
4
5function ChatComponent() {
6 const [messages, setMessages] = useState([]);
7 const [input, setInput] = useState('');
8 const [isStreaming, setIsStreaming] = useState(false);
9
10 const client = new OpenAI({
11 apiKey: process.env.NEXT_PUBLIC_SELAM_API_KEY,
12 baseURL: "https://api.selamgpt.com/v1",
13 dangerouslyAllowBrowser: true // Only for demo
14 });
15
16 const sendMessage = async () => {
17 if (!input.trim() || isStreaming) return;
18
19 const userMessage = { role: 'user', content: input };
20 setMessages(prev => [...prev, userMessage]);
21 setInput('');
22 setIsStreaming(true);
23
24 // Create assistant message placeholder
25 const assistantMessage = { role: 'assistant', content: '' };
26 setMessages(prev => [...prev, assistantMessage]);
27
28 try {
29 const stream = await client.chat.completions.create({
30 model: 'selam-turbo',
31 messages: [...messages, userMessage],
32 stream: true
33 });
34
35 for await (const chunk of stream) {
36 const content = chunk.choices[0]?.delta?.content;
37 if (content) {
38 setMessages(prev => {
39 const updated = [...prev];
40 updated[updated.length - 1].content += content;
41 return updated;
42 });
43 }
44 }
45 } catch (error) {
46 console.error('Streaming error:', error);
47 setMessages(prev => {
48 const updated = [...prev];
49 updated[updated.length - 1].content = 'Error: Failed to get response';
50 return updated;
51 });
52 } finally {
53 setIsStreaming(false);
54 }
55 };
56
57 return (
58 <div className="chat-container">
59 <div className="messages">
60 {messages.map((msg, idx) => (
61 <div key={idx} className={`message ${msg.role}`}>
62 {msg.content}
63 </div>
64 ))}
65 </div>
66 <div className="input-area">
67 <input
68 value={input}
69 onChange={(e) => setInput(e.target.value)}
70 onKeyPress={(e) => e.key === 'Enter' && sendMessage()}
71 disabled={isStreaming}
72 placeholder="Type a message..."
73 />
74 <button onClick={sendMessage} disabled={isStreaming}>
75 {isStreaming ? 'Sending...' : 'Send'}
76 </button>
77 </div>
78 <PageFeedback />
79
80
81 </div>
82 );
83}Warning
Security Note: Never expose your API key in client-side code in production. Use a backend proxy to make API calls and keep your key secure on the server.
Best Practices
Use Streaming for Interactive UIs
Streaming is ideal for chatbots, writing assistants, and any application where users expect real-time feedback. It makes your application feel more responsive and engaging.
Implement Proper Error Handling
Network issues can interrupt streams. Always implement retry logic with exponential backoff and provide clear error messages to users.
Handle Timeouts
Set reasonable timeouts for streaming requests. A stuck connection can hang indefinitely without proper timeout handling.
Buffer Partial Content
For UI updates, consider buffering chunks and updating the display at regular intervals (e.g., every 50ms) to avoid excessive re-renders.
Allow Interruption
Give users the ability to stop generation early. This is especially important for long responses where users may get the information they need before completion.
Monitor Performance
Track metrics like time-to-first-token and tokens-per-second to ensure good performance. Slow streaming can negatively impact user experience.
Related Resources
Was this page helpful?