Streaming Responses with ZSE

Enable real-time token streaming for responsive chat applications.

Why Streaming?

Without streaming, users wait for the entire response. With streaming:

• **~50ms** time-to-first-token

• Immediate visual feedback

• Better perceived performance

Enable Streaming

REST API

curl http://localhost:8000/v1/chat/completions \

-H "Content-Type: application/json" \

-d '{

"model": "qwen-7b",

"messages": [{"role": "user", "content": "Tell me a story"}],

"stream": true

Python (OpenAI SDK)

from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="x")

stream = client.chat.completions.create(

model="qwen-7b",

messages=[{"role": "user", "content": "Tell me a story"}],

stream=True

)

for chunk in stream:

if chunk.choices[0].delta.content:

print(chunk.choices[0].delta.content, end="", flush=True)

JavaScript/React

async function streamChat(message) {

const response = await fetch('/v1/chat/completions', {

method: 'POST',

headers: { 'Content-Type': 'application/json' },

body: JSON.stringify({

model: 'qwen-7b',

messages: [{ role: 'user', content: message }],

stream: true

})

});

const reader = response.body.getReader();

const decoder = new TextDecoder();

while (true) {

const { done, value } = await reader.read();

if (done) break;

const text = decoder.decode(value);

// Parse SSE and update UI

console.log(text);

}

Server-Sent Events Format

Each chunk is prefixed with data: :

data: {"choices":[{"delta":{"content":"Once"}}]}

data: {"choices":[{"delta":{"content":" upon"}}]}

data: {"choices":[{"delta":{"content":" a"}}]}

data: [DONE]

Best Practices

1. **Handle backpressure** - Don't overwhelm slow clients

2. **Implement cancellation** - Let users stop generation

3. **Show typing indicator** - While waiting for first token

4. **Buffer intelligently** - Consider word-level chunks for smoother UX

Streaming Responses with ZSE: Real-time Token Generation

Streaming Responses with ZSE

Why Streaming?

Enable Streaming

REST API

Python (OpenAI SDK)

JavaScript/React

Server-Sent Events Format

Best Practices

Related Posts

Complete Guide: Running Your First Model with ZSE

Running 70B Models on a 24GB GPU with ZSE

Building a Local RAG Chatbot with ZSE