Feature

zServe

Production-ready inference server with OpenAI-compatible API and sub-4-second cold start.

Overview

zServe is ZSE's high-performance inference server, designed for production workloads with minimal latency and maximum throughput.

3.9s Cold Start

Fastest model loading in the industry

OpenAI Compatible

Drop-in replacement for OpenAI API

Multi-GPU

Automatic model parallelism

OpenAI-compatible /v1/chat/completions endpoint
Streaming responses with SSE
Request batching and queuing
Built-in rate limiting
Prometheus metrics endpoint
Health checks and graceful shutdown

Basic Usage

Start the server

Launch with a model

zse serve qwen-7b.zse

Test the endpoint

Send a request with curl

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model": "qwen-7b", "messages": [{"role": "user", "content": "Hello!"}]}'

Use with OpenAI client

Point your existing code to the server

from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="not-needed")
response = client.chat.completions.create(
    model="qwen-7b",
    messages=[{"role": "user", "content": "Hello!"}]
)

The server starts on port 8000 by default. Use --port to change.

Configuration

CLI Options

bash

zse serve model.zse \
  --port 8000           # Server port
  --host 0.0.0.0        # Bind address
  --workers 4           # Worker processes
  --max-batch 32        # Max batch size
  --max-concurrent 100  # Max concurrent requests
  --timeout 60          # Request timeout (seconds)
  --api-key "sk-xxx"    # Require API key

Config File

Use a YAML config file for complex setups:

zse.yaml

server:
  host: 0.0.0.0
  port: 8000
  workers: 4
  
model:
  path: ./qwen-7b.zse
  max_batch_size: 32
  max_sequence_length: 4096
  
inference:
  temperature: 0.7
  top_p: 0.9
  max_tokens: 2048
  
limits:
  max_concurrent_requests: 100
  requests_per_minute: 1000
  timeout_seconds: 60
  
auth:
  api_keys:
    - sk-key-1
    - sk-key-2

bash

zse serve --config zse.yaml

Environment Variables

bash

# Model configuration
export ZSE_MODEL_PATH="./qwen-7b.zse"
export ZSE_MAX_BATCH_SIZE=32
 
# Server configuration
export ZSE_PORT=8000
export ZSE_HOST=0.0.0.0
 
# Auth
export ZSE_API_KEY="sk-xxx"
 
# Start server
zse serve

Multi-Model Serving

Serve multiple models from a single server instance:

bash

# Serve multiple models from a directory
zse serve ./models/
 
# Or specify individually
zse serve model1.zse model2.zse model3.zse

Models are loaded on-demand with LRU caching:

zse.yaml

models:
  - name: qwen-7b
    path: ./qwen-7b.zse
    max_loaded: true      # Keep loaded
    
  - name: llama-8b
    path: ./llama-8b.zse
    max_loaded: false     # Load on demand
    
  - name: codellama-34b
    path: ./codellama-34b.zse
    gpu: [0, 1]           # Specific GPUs
 
cache:
  max_models: 3           # Max models in memory
  eviction: lru           # Eviction policy

Pin frequently-used models with max_loaded: true to avoid cold starts.

Scaling

Multi-GPU

Automatically shard large models across GPUs:

bash

# Auto-detect and use all GPUs
zse serve model.zse --tensor-parallel auto
 
# Specify GPUs
zse serve model.zse --tensor-parallel 4 --gpus 0,1,2,3
 
# Pipeline parallelism for very large models
zse serve model.zse --pipeline-parallel 2

Load Balancing

Run multiple instances behind a load balancer:

nginx.conf

upstream zse {
    least_conn;
    server 127.0.0.1:8000;
    server 127.0.0.1:8001;
    server 127.0.0.1:8002;
}
 
server {
    listen 80;
    
    location /v1/ {
        proxy_pass http://zse;
        proxy_http_version 1.1;
        proxy_set_header Connection "";
        
        # For streaming
        proxy_buffering off;
        proxy_cache off;
    }
}

Kubernetes

deployment.yaml

apiVersion: apps/v1
kind: Deployment
metadata:
  name: zse
spec:
  replicas: 3
  selector:
    matchLabels:
      app: zse
  template:
    spec:
      containers:
      - name: zse
        image: zllm/zse:latest
        command: ["zse", "serve", "/models/qwen-7b.zse"]
        ports:
        - containerPort: 8000
        resources:
          limits:
            nvidia.com/gpu: 1
        volumeMounts:
        - name: models
          mountPath: /models
        readinessProbe:
          httpGet:
            path: /health
            port: 8000

Monitoring

Health Endpoint

bash

curl http://localhost:8000/health

json

{
  "status": "healthy",
  "model": "qwen-7b",
  "uptime": 3600,
  "requests_processed": 15420,
  "gpu_memory_used": "4.2 GB",
  "gpu_memory_total": "24 GB"
}

Prometheus Metrics

bash

curl http://localhost:8000/metrics

text

# TYPE zse_requests_total counter
zse_requests_total{model="qwen-7b",status="success"} 15420
zse_requests_total{model="qwen-7b",status="error"} 12
 
# TYPE zse_request_duration_seconds histogram
zse_request_duration_seconds_bucket{le="0.1"} 1000
zse_request_duration_seconds_bucket{le="0.5"} 12000
zse_request_duration_seconds_bucket{le="1.0"} 15000
 
# TYPE zse_tokens_generated_total counter
zse_tokens_generated_total{model="qwen-7b"} 2456789
 
# TYPE zse_gpu_memory_bytes gauge
zse_gpu_memory_bytes{gpu="0"} 4500000000

Logging

bash

# Verbose logging
zse serve model.zse --log-level debug
 
# JSON logs for production
zse serve model.zse --log-format json
 
# Log to file
zse serve model.zse --log-file /var/log/zse/server.log

← zQuantize

zInfer →