Back to blog
tutorialgetting-started

Complete Guide: Running Your First Model with ZSE

ZSE TeamFebruary 24, 20266 min read

Complete Guide: Running Your First Model with ZSE

This guide walks you through installing ZSE and running your first LLM inference in under 5 minutes.

Prerequisites

Python 3.9+

CUDA 11.8+ (for GPU) or CPU-only support

8GB+ VRAM for 7B models

Step 1: Install ZSE

pip install zllm-zse

Verify installation:

zse --version

zse hardware # Check GPU detection

Step 2: Convert a Model

Convert a HuggingFace model to .zse format:

zse convert Qwen/Qwen2.5-7B-Instruct -o qwen-7b.zse

This downloads the model (~14GB), quantizes it to NF4, and saves a 4.2GB .zse file.

Step 3: Start the Server

zse serve qwen-7b.zse --port 8000

Step 4: Send a Request

Using curl:

curl http://localhost:8000/v1/chat/completions \

-H "Content-Type: application/json" \

-d '{"model": "qwen-7b", "messages": [{"role": "user", "content": "Hello!"}]}'

Or with Python:

from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="not-needed")

response = client.chat.completions.create(

model="qwen-7b",

messages=[{"role": "user", "content": "Explain quantum computing"}]

)

print(response.choices[0].message.content)

Next Steps

Enable streaming with `stream=True`

Try different quantization: `--quant int4` or `--quant int8`

Read the [Documentation](/docs) for advanced features

Related Posts