Complete Guide: Running Your First Model with ZSE

This guide walks you through installing ZSE and running your first LLM inference in under 5 minutes.

Prerequisites

• Python 3.9+

• CUDA 11.8+ (for GPU) or CPU-only support

• 8GB+ VRAM for 7B models

pip install zllm-zse

Verify installation:

zse --version

zse hardware # Check GPU detection

Convert a HuggingFace model to .zse format:

zse convert Qwen/Qwen2.5-7B-Instruct -o qwen-7b.zse

This downloads the model (~14GB), quantizes it to NF4, and saves a 4.2GB .zse file.

zse serve qwen-7b.zse --port 8000

Using curl:

curl http://localhost:8000/v1/chat/completions \

-H "Content-Type: application/json" \

-d '{"model": "qwen-7b", "messages": [{"role": "user", "content": "Hello!"}]}'

Or with Python:

from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="not-needed")

response = client.chat.completions.create(

model="qwen-7b",

messages=[{"role": "user", "content": "Explain quantum computing"}]

)

print(response.choices[0].message.content)

• Enable streaming with `stream=True`

• Try different quantization: `--quant int4` or `--quant int8`

• Read the [Documentation](/docs) for advanced features