Getting Started

Quick Start

Get up and running with ZSE in under 5 minutes. This guide covers installation, running your first model, and converting to the fast .zse format.

Prerequisites

Before installing ZSE, ensure you have:

  • Python 3.8+ — We recommend Python 3.10 or newer
  • CUDA 11.8+ — For GPU acceleration (optional but recommended)
  • 8GB+ VRAM — For 7B models; 24GB+ for 32B models

CPU-Only Mode

ZSE can run on CPU-only machines, but performance will be significantly slower. GPU acceleration is recommended for production use.

Installation

Install ZSE from PyPI with pip:

bash
pip install zllm-zse

For GGUF model support (Ollama/llama.cpp models):

bash
pip install zllm-zse[gguf]
ZSE will automatically detect and use your GPU if available. No additional configuration is required for CUDA.

Verify the installation:

bash
zse --version
# Output: zse 0.1.2
zse hardware
# Shows detected GPUs and available memory

Start the Server

Serve any HuggingFace model with a single command:

bash
zse serve Qwen/Qwen2.5-7B-Instruct

The server will:

  • 1.Download the model from HuggingFace (cached for future use)
  • 2.Quantize to INT4 format (first run only)
  • 3.Start an OpenAI-compatible API at http://localhost:8000
Once started, you'll see: Server running at http://localhost:8000

Common server options:

bash
# Custom port and host
zse serve Qwen/Qwen2.5-7B-Instruct --port 8080 --host 0.0.0.0
# With API key authentication
zse serve Qwen/Qwen2.5-7B-Instruct --api-key your-secret-key
# Specify quantization type
zse serve Qwen/Qwen2.5-7B-Instruct --quant nf4

Make API Calls

ZSE provides an OpenAI-compatible API. Use any OpenAI SDK or make direct HTTP requests:

curl
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "default",
"messages": [
{"role": "user", "content": "Write hello world in Python"}
]
}'

Using the Python OpenAI SDK:

client.py
1from openai import OpenAI
2
3client = OpenAI(
4 base_url="http://localhost:8000/v1",
5 api_key="not-needed" # Unless you set --api-key
6)
7
8response = client.chat.completions.create(
9 model="default",
10 messages=[
11 {"role": "user", "content": "Write hello world in Python"}
12 ],
13 stream=True # Enable streaming
14)
15
16for chunk in response:
17 if chunk.choices[0].delta.content:
18 print(chunk.choices[0].delta.content, end="")

Convert to .zse Format

For maximum performance, pre-convert models to the .zse format. This eliminates runtime quantization and achieves 11× faster cold starts.

1

Convert the model

One-time conversion, takes ~20 seconds

bash
zse convert Qwen/Qwen2.5-7B-Instruct -o qwen-7b.zse
2

Serve the converted model

Now cold starts take only 3.9 seconds!

bash
zse serve qwen-7b.zse
The .zse file is ~4GB for 7B models. Ensure you have sufficient disk space.

Next Steps