Quick Start
Get up and running with ZSE in under 5 minutes. This guide covers installation, running your first model, and converting to the fast .zse format.
Prerequisites
Before installing ZSE, ensure you have:
- •Python 3.8+ — We recommend Python 3.10 or newer
- •CUDA 11.8+ — For GPU acceleration (optional but recommended)
- •8GB+ VRAM — For 7B models; 24GB+ for 32B models
CPU-Only Mode
ZSE can run on CPU-only machines, but performance will be significantly slower. GPU acceleration is recommended for production use.
Installation
Install ZSE from PyPI with pip:
bash
pip install zllm-zseFor GGUF model support (Ollama/llama.cpp models):
bash
pip install zllm-zse[gguf]ZSE will automatically detect and use your GPU if available. No additional configuration is required for CUDA.
Verify the installation:
bash
zse --version# Output: zse 0.1.2 zse hardware# Shows detected GPUs and available memoryStart the Server
Serve any HuggingFace model with a single command:
bash
zse serve Qwen/Qwen2.5-7B-InstructThe server will:
- 1.Download the model from HuggingFace (cached for future use)
- 2.Quantize to INT4 format (first run only)
- 3.Start an OpenAI-compatible API at
http://localhost:8000
Once started, you'll see:
Server running at http://localhost:8000Common server options:
bash
# Custom port and hostzse serve Qwen/Qwen2.5-7B-Instruct --port 8080 --host 0.0.0.0 # With API key authenticationzse serve Qwen/Qwen2.5-7B-Instruct --api-key your-secret-key # Specify quantization typezse serve Qwen/Qwen2.5-7B-Instruct --quant nf4Make API Calls
ZSE provides an OpenAI-compatible API. Use any OpenAI SDK or make direct HTTP requests:
curl
curl http://localhost:8000/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "default", "messages": [ {"role": "user", "content": "Write hello world in Python"} ] }'Using the Python OpenAI SDK:
client.py
1from openai import OpenAI2 3client = OpenAI(4 base_url="http://localhost:8000/v1",5 api_key="not-needed" # Unless you set --api-key6)7 8response = client.chat.completions.create(9 model="default",10 messages=[11 {"role": "user", "content": "Write hello world in Python"}12 ],13 stream=True # Enable streaming14)15 16for chunk in response:17 if chunk.choices[0].delta.content:18 print(chunk.choices[0].delta.content, end="")Convert to .zse Format
For maximum performance, pre-convert models to the .zse format. This eliminates runtime quantization and achieves 11× faster cold starts.
1
Convert the model
One-time conversion, takes ~20 seconds
bash
zse convert Qwen/Qwen2.5-7B-Instruct -o qwen-7b.zse2
Serve the converted model
Now cold starts take only 3.9 seconds!
bash
zse serve qwen-7b.zseThe
.zse file is ~4GB for 7B models. Ensure you have sufficient disk space.