zServe
Production-ready inference server with OpenAI-compatible API and sub-4-second cold start.
Overview
zServe is ZSE's high-performance inference server, designed for production workloads with minimal latency and maximum throughput.
3.9s Cold Start
Fastest model loading in the industry
OpenAI Compatible
Drop-in replacement for OpenAI API
Multi-GPU
Automatic model parallelism
- OpenAI-compatible /v1/chat/completions endpoint
- Streaming responses with SSE
- Request batching and queuing
- Built-in rate limiting
- Prometheus metrics endpoint
- Health checks and graceful shutdown
Basic Usage
1
Start the server
Launch with a model
zse serve qwen-7b.zse2
Test the endpoint
Send a request with curl
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"model": "qwen-7b", "messages": [{"role": "user", "content": "Hello!"}]}'3
Use with OpenAI client
Point your existing code to the server
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="not-needed")
response = client.chat.completions.create(
model="qwen-7b",
messages=[{"role": "user", "content": "Hello!"}]
)The server starts on port 8000 by default. Use
--port to change.Configuration
CLI Options
bash
zse serve model.zse \ --port 8000 # Server port --host 0.0.0.0 # Bind address --workers 4 # Worker processes --max-batch 32 # Max batch size --max-concurrent 100 # Max concurrent requests --timeout 60 # Request timeout (seconds) --api-key "sk-xxx" # Require API keyConfig File
Use a YAML config file for complex setups:
zse.yaml
server: host: 0.0.0.0 port: 8000 workers: 4 model: path: ./qwen-7b.zse max_batch_size: 32 max_sequence_length: 4096 inference: temperature: 0.7 top_p: 0.9 max_tokens: 2048 limits: max_concurrent_requests: 100 requests_per_minute: 1000 timeout_seconds: 60 auth: api_keys: - sk-key-1 - sk-key-2bash
zse serve --config zse.yamlEnvironment Variables
bash
# Model configurationexport ZSE_MODEL_PATH="./qwen-7b.zse"export ZSE_MAX_BATCH_SIZE=32 # Server configurationexport ZSE_PORT=8000export ZSE_HOST=0.0.0.0 # Authexport ZSE_API_KEY="sk-xxx" # Start serverzse serveMulti-Model Serving
Serve multiple models from a single server instance:
bash
# Serve multiple models from a directoryzse serve ./models/ # Or specify individuallyzse serve model1.zse model2.zse model3.zseModels are loaded on-demand with LRU caching:
zse.yaml
models: - name: qwen-7b path: ./qwen-7b.zse max_loaded: true # Keep loaded - name: llama-8b path: ./llama-8b.zse max_loaded: false # Load on demand - name: codellama-34b path: ./codellama-34b.zse gpu: [0, 1] # Specific GPUs cache: max_models: 3 # Max models in memory eviction: lru # Eviction policyPin frequently-used models with
max_loaded: true to avoid cold starts.Scaling
Multi-GPU
Automatically shard large models across GPUs:
bash
# Auto-detect and use all GPUszse serve model.zse --tensor-parallel auto # Specify GPUszse serve model.zse --tensor-parallel 4 --gpus 0,1,2,3 # Pipeline parallelism for very large modelszse serve model.zse --pipeline-parallel 2Load Balancing
Run multiple instances behind a load balancer:
nginx.conf
upstream zse { least_conn; server 127.0.0.1:8000; server 127.0.0.1:8001; server 127.0.0.1:8002;} server { listen 80; location /v1/ { proxy_pass http://zse; proxy_http_version 1.1; proxy_set_header Connection ""; # For streaming proxy_buffering off; proxy_cache off; }}Kubernetes
deployment.yaml
apiVersion: apps/v1kind: Deploymentmetadata: name: zsespec: replicas: 3 selector: matchLabels: app: zse template: spec: containers: - name: zse image: zllm/zse:latest command: ["zse", "serve", "/models/qwen-7b.zse"] ports: - containerPort: 8000 resources: limits: nvidia.com/gpu: 1 volumeMounts: - name: models mountPath: /models readinessProbe: httpGet: path: /health port: 8000Monitoring
Health Endpoint
bash
curl http://localhost:8000/healthjson
{ "status": "healthy", "model": "qwen-7b", "uptime": 3600, "requests_processed": 15420, "gpu_memory_used": "4.2 GB", "gpu_memory_total": "24 GB"}Prometheus Metrics
bash
curl http://localhost:8000/metricstext
# TYPE zse_requests_total counterzse_requests_total{model="qwen-7b",status="success"} 15420zse_requests_total{model="qwen-7b",status="error"} 12 # TYPE zse_request_duration_seconds histogramzse_request_duration_seconds_bucket{le="0.1"} 1000zse_request_duration_seconds_bucket{le="0.5"} 12000zse_request_duration_seconds_bucket{le="1.0"} 15000 # TYPE zse_tokens_generated_total counterzse_tokens_generated_total{model="qwen-7b"} 2456789 # TYPE zse_gpu_memory_bytes gaugezse_gpu_memory_bytes{gpu="0"} 4500000000Logging
bash
# Verbose loggingzse serve model.zse --log-level debug # JSON logs for productionzse serve model.zse --log-format json # Log to filezse serve model.zse --log-file /var/log/zse/server.log