Introduction
ZSE (Z Server Engine) is an ultra memory-efficient LLM inference engine designed for fast cold starts and low memory usage.
What is ZSE?
ZSE is an inference engine that loads large language models in seconds, not minutes. It achieves this through pre-quantized model formats that skip runtime quantization entirely.
Why ZSE?
.zse format eliminates this overhead, enabling sub-4-second cold starts for 7B models.Whether you're building serverless AI endpoints, developing locally, or deploying to production, ZSE helps you iterate faster and reduce costs.
Key Features
zQuantize
Pre-quantize models to INT4/NF4 format for instant loading
zServe
OpenAI-compatible API server with streaming support
zInfer
CLI tool for quick model testing and inference
zStream
Layer streaming for running large models on limited VRAM
zKV
Quantized KV cache for 4× memory savings
OpenAI API
Drop-in replacement for OpenAI's chat completions API
- 3.9s cold start for 7B models (11.6× faster than bitsandbytes)
- 21.4s cold start for 32B models (5.6× faster)
- 63-72% memory savings with INT4 quantization
- GGUF model import support
- Multi-model management
- Streaming token generation
Benchmarks
Cold start benchmarks on A100-80GB with Qwen 2.5 Coder models:
| Model | bitsandbytes | ZSE (.zse) | Speedup |
|---|---|---|---|
| 7B | 45.3s | 3.9s | 11.6× |
| 14B | 78.2s | 8.1s | 9.7× |
| 32B | 120.0s | 21.4s | 5.6× |
Quick Install
Install ZSE from PyPI:
pip install zllm-zseFor GGUF model support, install with the optional dependency:
pip install zllm-zse[gguf]Start a server with a pre-trained model:
# Start the serverzse serve Qwen/Qwen2.5-7B-Instruct # Or with custom settingszse serve Qwen/Qwen2.5-7B-Instruct --port 8080 --host 0.0.0.0