Deploying ZSE to Production

A complete guide to deploying ZSE in production with Docker, Kubernetes, and proper monitoring.

Docker Deployment

Dockerfile

FROM nvidia/cuda:12.1-runtime-ubuntu22.04

RUN pip install zllm-zse

COPY models/qwen-7b.zse /models/

EXPOSE 8000

CMD ["zse", "serve", "/models/qwen-7b.zse", "--host", "0.0.0.0"]

Docker Compose

version: '3.8'

services:

zse:

build: .

ports:

- "8000:8000"

deploy:

resources:

reservations:

devices:

- capabilities: [gpu]

volumes:

- ./models:/models

environment:

- ZSE_MAX_BATCH_SIZE=32

- ZSE_LOG_LEVEL=info

Kubernetes Deployment

apiVersion: apps/v1

kind: Deployment

metadata:

name: zse-inference

spec:

replicas: 3

selector:

matchLabels:

app: zse

template:

spec:

containers:

- name: zse

image: your-registry/zse:latest

ports:

- containerPort: 8000

resources:

limits:

nvidia.com/gpu: 1

memory: "32Gi"

readinessProbe:

httpGet:

path: /health

port: 8000

initialDelaySeconds: 10

livenessProbe:

httpGet:

path: /health

port: 8000

periodSeconds: 30

Health Checks & Monitoring

Prometheus Metrics

ZSE exposes metrics at /metrics:

• `zse_requests_total`

• `zse_request_duration_seconds`

• `zse_tokens_generated_total`

• `zse_gpu_memory_bytes`

Alerting Rules

groups:

• name: zse

rules:

- alert: HighLatency

expr: zse_request_duration_seconds{quantile="0.99"} > 5

for: 5m

- alert: GPUMemoryHigh

expr: zse_gpu_memory_bytes / zse_gpu_memory_total > 0.95

for: 1m

Production Checklist

• [ ] Set appropriate `--max-batch` and `--max-concurrent`

• [ ] Enable request logging: `--log-format json`

• [ ] Configure rate limiting: `--rate-limit 100`

• [ ] Set up Prometheus scraping

• [ ] Test graceful shutdown handling

• [ ] Configure horizontal pod autoscaling