Deploying ZSE to Production: Docker & Kubernetes
Deploying ZSE to Production
A complete guide to deploying ZSE in production with Docker, Kubernetes, and proper monitoring.
Docker Deployment
Dockerfile
FROM nvidia/cuda:12.1-runtime-ubuntu22.04
RUN pip install zllm-zse
COPY models/qwen-7b.zse /models/
EXPOSE 8000
CMD ["zse", "serve", "/models/qwen-7b.zse", "--host", "0.0.0.0"]
Docker Compose
version: '3.8'
services:
zse:
build: .
ports:
- "8000:8000"
deploy:
resources:
reservations:
devices:
- capabilities: [gpu]
volumes:
- ./models:/models
environment:
- ZSE_MAX_BATCH_SIZE=32
- ZSE_LOG_LEVEL=info
Kubernetes Deployment
apiVersion: apps/v1
kind: Deployment
metadata:
name: zse-inference
spec:
replicas: 3
selector:
matchLabels:
app: zse
template:
spec:
containers:
- name: zse
image: your-registry/zse:latest
ports:
- containerPort: 8000
resources:
limits:
nvidia.com/gpu: 1
memory: "32Gi"
readinessProbe:
httpGet:
path: /health
port: 8000
initialDelaySeconds: 10
livenessProbe:
httpGet:
path: /health
port: 8000
periodSeconds: 30
Health Checks & Monitoring
Prometheus Metrics
ZSE exposes metrics at /metrics:
• `zse_requests_total`
• `zse_request_duration_seconds`
• `zse_tokens_generated_total`
• `zse_gpu_memory_bytes`
Alerting Rules
groups:
• name: zse
rules:
- alert: HighLatency
expr: zse_request_duration_seconds{quantile="0.99"} > 5
for: 5m
- alert: GPUMemoryHigh
expr: zse_gpu_memory_bytes / zse_gpu_memory_total > 0.95
for: 1m
Production Checklist
• [ ] Set appropriate `--max-batch` and `--max-concurrent`
• [ ] Enable request logging: `--log-format json`
• [ ] Configure rate limiting: `--rate-limit 100`
• [ ] Set up Prometheus scraping
• [ ] Test graceful shutdown handling
• [ ] Configure horizontal pod autoscaling