Feature

zServe

Production-ready inference server with OpenAI-compatible API and sub-4-second cold start.

Overview

zServe is ZSE's high-performance inference server, designed for production workloads with minimal latency and maximum throughput.

3.9s Cold Start

Fastest model loading in the industry

OpenAI Compatible

Drop-in replacement for OpenAI API

Multi-GPU

Automatic model parallelism

  • OpenAI-compatible /v1/chat/completions endpoint
  • Streaming responses with SSE
  • Request batching and queuing
  • Built-in rate limiting
  • Prometheus metrics endpoint
  • Health checks and graceful shutdown

Basic Usage

1

Start the server

Launch with a model

zse serve qwen-7b.zse
2

Test the endpoint

Send a request with curl

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model": "qwen-7b", "messages": [{"role": "user", "content": "Hello!"}]}'
3

Use with OpenAI client

Point your existing code to the server

from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="not-needed")
response = client.chat.completions.create(
    model="qwen-7b",
    messages=[{"role": "user", "content": "Hello!"}]
)
The server starts on port 8000 by default. Use --port to change.

Configuration

CLI Options

bash
zse serve model.zse \
--port 8000 # Server port
--host 0.0.0.0 # Bind address
--workers 4 # Worker processes
--max-batch 32 # Max batch size
--max-concurrent 100 # Max concurrent requests
--timeout 60 # Request timeout (seconds)
--api-key "sk-xxx" # Require API key

Config File

Use a YAML config file for complex setups:

zse.yaml
server:
host: 0.0.0.0
port: 8000
workers: 4
model:
path: ./qwen-7b.zse
max_batch_size: 32
max_sequence_length: 4096
inference:
temperature: 0.7
top_p: 0.9
max_tokens: 2048
limits:
max_concurrent_requests: 100
requests_per_minute: 1000
timeout_seconds: 60
auth:
api_keys:
- sk-key-1
- sk-key-2
bash
zse serve --config zse.yaml

Environment Variables

bash
# Model configuration
export ZSE_MODEL_PATH="./qwen-7b.zse"
export ZSE_MAX_BATCH_SIZE=32
# Server configuration
export ZSE_PORT=8000
export ZSE_HOST=0.0.0.0
# Auth
export ZSE_API_KEY="sk-xxx"
# Start server
zse serve

Multi-Model Serving

Serve multiple models from a single server instance:

bash
# Serve multiple models from a directory
zse serve ./models/
# Or specify individually
zse serve model1.zse model2.zse model3.zse

Models are loaded on-demand with LRU caching:

zse.yaml
models:
- name: qwen-7b
path: ./qwen-7b.zse
max_loaded: true # Keep loaded
- name: llama-8b
path: ./llama-8b.zse
max_loaded: false # Load on demand
- name: codellama-34b
path: ./codellama-34b.zse
gpu: [0, 1] # Specific GPUs
cache:
max_models: 3 # Max models in memory
eviction: lru # Eviction policy
Pin frequently-used models with max_loaded: true to avoid cold starts.

Scaling

Multi-GPU

Automatically shard large models across GPUs:

bash
# Auto-detect and use all GPUs
zse serve model.zse --tensor-parallel auto
# Specify GPUs
zse serve model.zse --tensor-parallel 4 --gpus 0,1,2,3
# Pipeline parallelism for very large models
zse serve model.zse --pipeline-parallel 2

Load Balancing

Run multiple instances behind a load balancer:

nginx.conf
upstream zse {
least_conn;
server 127.0.0.1:8000;
server 127.0.0.1:8001;
server 127.0.0.1:8002;
}
server {
listen 80;
location /v1/ {
proxy_pass http://zse;
proxy_http_version 1.1;
proxy_set_header Connection "";
# For streaming
proxy_buffering off;
proxy_cache off;
}
}

Kubernetes

deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: zse
spec:
replicas: 3
selector:
matchLabels:
app: zse
template:
spec:
containers:
- name: zse
image: zllm/zse:latest
command: ["zse", "serve", "/models/qwen-7b.zse"]
ports:
- containerPort: 8000
resources:
limits:
nvidia.com/gpu: 1
volumeMounts:
- name: models
mountPath: /models
readinessProbe:
httpGet:
path: /health
port: 8000

Monitoring

Health Endpoint

bash
curl http://localhost:8000/health
json
{
"status": "healthy",
"model": "qwen-7b",
"uptime": 3600,
"requests_processed": 15420,
"gpu_memory_used": "4.2 GB",
"gpu_memory_total": "24 GB"
}

Prometheus Metrics

bash
curl http://localhost:8000/metrics
text
# TYPE zse_requests_total counter
zse_requests_total{model="qwen-7b",status="success"} 15420
zse_requests_total{model="qwen-7b",status="error"} 12
# TYPE zse_request_duration_seconds histogram
zse_request_duration_seconds_bucket{le="0.1"} 1000
zse_request_duration_seconds_bucket{le="0.5"} 12000
zse_request_duration_seconds_bucket{le="1.0"} 15000
# TYPE zse_tokens_generated_total counter
zse_tokens_generated_total{model="qwen-7b"} 2456789
# TYPE zse_gpu_memory_bytes gauge
zse_gpu_memory_bytes{gpu="0"} 4500000000

Logging

bash
# Verbose logging
zse serve model.zse --log-level debug
# JSON logs for production
zse serve model.zse --log-format json
# Log to file
zse serve model.zse --log-file /var/log/zse/server.log