Streaming Responses with ZSE: Real-time Token Generation
Streaming Responses with ZSE
Enable real-time token streaming for responsive chat applications.
Why Streaming?
Without streaming, users wait for the entire response. With streaming:
• **~50ms** time-to-first-token
• Immediate visual feedback
• Better perceived performance
Enable Streaming
REST API
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "qwen-7b",
"messages": [{"role": "user", "content": "Tell me a story"}],
"stream": true
}'
Python (OpenAI SDK)
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="x")
stream = client.chat.completions.create(
model="qwen-7b",
messages=[{"role": "user", "content": "Tell me a story"}],
stream=True
)
for chunk in stream:
if chunk.choices[0].delta.content:
print(chunk.choices[0].delta.content, end="", flush=True)
JavaScript/React
async function streamChat(message) {
const response = await fetch('/v1/chat/completions', {
method: 'POST',
headers: { 'Content-Type': 'application/json' },
body: JSON.stringify({
model: 'qwen-7b',
messages: [{ role: 'user', content: message }],
stream: true
})
});
const reader = response.body.getReader();
const decoder = new TextDecoder();
while (true) {
const { done, value } = await reader.read();
if (done) break;
const text = decoder.decode(value);
// Parse SSE and update UI
console.log(text);
}
}
Server-Sent Events Format
Each chunk is prefixed with data: :
data: {"choices":[{"delta":{"content":"Once"}}]}
data: {"choices":[{"delta":{"content":" upon"}}]}
data: {"choices":[{"delta":{"content":" a"}}]}
data: [DONE]
Best Practices
1. **Handle backpressure** - Don't overwhelm slow clients
2. **Implement cancellation** - Let users stop generation
3. **Show typing indicator** - While waiting for first token
4. **Buffer intelligently** - Consider word-level chunks for smoother UX
Related Posts
Complete Guide: Running Your First Model with ZSE
Step-by-step tutorial to install ZSE, convert a model, and start generating text in under 5 minutes.
Running 70B Models on a 24GB GPU with ZSE
How to run Llama 70B and other large models on consumer GPUs using ZSE's memory optimization features.
Building a Local RAG Chatbot with ZSE
Create a retrieval-augmented generation chatbot that answers questions about your documents.