Getting Started

Quick Start

Get up and running with ZSE in under 5 minutes. This guide covers installation, running your first model, and converting to the fast .zse format.

Prerequisites

Before installing ZSE, ensure you have:

•Python 3.8+ — We recommend Python 3.10 or newer
•CUDA 11.8+ — For GPU acceleration (optional but recommended)
•8GB+ VRAM — For 7B models; 24GB+ for 32B models

CPU-Only Mode

ZSE can run on CPU-only machines, but performance will be significantly slower. GPU acceleration is recommended for production use.

Installation

Install ZSE from PyPI with pip:

bash

pip install zllm-zse

For GGUF model support (Ollama/llama.cpp models):

bash

pip install zllm-zse[gguf]

ZSE will automatically detect and use your GPU if available. No additional configuration is required for CUDA.

Verify the installation:

bash

zse --version
# Output: zse 1.2.0
 
zse hardware
# Shows detected GPUs and available memory

Start the Server

Serve any HuggingFace model with a single command:

bash

zse serve Qwen/Qwen2.5-7B-Instruct

The server will:

1.Download the model from HuggingFace (cached for future use)
2.Quantize to INT4 format (first run only)
3.Start an OpenAI-compatible API at http://localhost:8000

Once started, you'll see: Server running at http://localhost:8000

Common server options:

bash

# Custom port and host
zse serve Qwen/Qwen2.5-7B-Instruct --port 8080 --host 0.0.0.0
 
# With API key authentication
zse serve Qwen/Qwen2.5-7B-Instruct --api-key your-secret-key
 
# Specify quantization type
zse serve Qwen/Qwen2.5-7B-Instruct --quant nf4

Make API Calls

ZSE provides an OpenAI-compatible API. Use any OpenAI SDK or make direct HTTP requests:

curl

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "default",
    "messages": [
      {"role": "user", "content": "Write hello world in Python"}
    ]
  }'

Using the Python OpenAI SDK:

client.py

1from openai import OpenAI
2 
3client = OpenAI(
4    base_url="http://localhost:8000/v1",
5    api_key="not-needed"  # Unless you set --api-key
6)
7 
8response = client.chat.completions.create(
9    model="default",
10    messages=[
11        {"role": "user", "content": "Write hello world in Python"}
12    ],
13    stream=True  # Enable streaming
14)
15 
16for chunk in response:
17    if chunk.choices[0].delta.content:
18        print(chunk.choices[0].delta.content, end="")

Convert to .zse Format

For maximum performance, pre-convert models to the .zse format. This eliminates runtime quantization and achieves 11× faster cold starts.

Convert the model

One-time conversion, takes ~20 seconds

bash

zse convert Qwen/Qwen2.5-7B-Instruct -o qwen-7b.zse

Serve the converted model

Now cold starts take only 3.9 seconds!

bash

zse serve qwen-7b.zse

The .zse file is ~4GB for 7B models. Ensure you have sufficient disk space.

Installation →

Quick Start

Prerequisites

Installation

Start the Server

Make API Calls

Convert to .zse Format

Convert the model

Serve the converted model

Next Steps

Installation Details

Model Formats

API Reference

REST API