Getting Started

Your First Model

Download, convert, and run your first LLM with ZSE in under 5 minutes.

Choosing a Model

Choose a model based on your GPU memory:

GPU VRAM	Recommended Model	HuggingFace ID
6-8 GB	Qwen 2.5 3B	`Qwen/Qwen2.5-3B-Instruct`
8-12 GB	Qwen 2.5 7B ✓	`Qwen/Qwen2.5-7B-Instruct`
12-16 GB	Llama 3.1 8B	`meta-llama/Llama-3.1-8B-Instruct`
24+ GB	Qwen 2.5 14B	`Qwen/Qwen2.5-14B-Instruct`

Not sure? Check your GPU with zse hardware

Download & Convert

Check your hardware

Verify GPU is detected

zse hardware

Convert the model

Download from HuggingFace and convert to .zse format

zse convert Qwen/Qwen2.5-7B-Instruct -o qwen-7b.zse

Verify the conversion

Check model info

zse info qwen-7b.zse

First conversion downloads ~7GB and takes ~2 minutes on GPU (longer on CPU). Files are cached for future conversions.

Running Your Model

Interactive Chat

bash

zse chat qwen-7b.zse

API Server

bash

# Start the server
zse serve qwen-7b.zse
 
# Test with curl
curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model": "qwen-7b", "messages": [{"role": "user", "content": "Hello!"}]}'

Python Code

python

from zllm_zse import ZSE
 
model = ZSE("qwen-7b.zse")
response = model.chat([
    {"role": "user", "content": "Hello!"}
])
print(response)

Verify Everything Works

Run a quick benchmark to verify your setup:

bash

zse benchmark qwen-7b.zse

Expected results on consumer GPUs:

GPU	Cold Start	Throughput
RTX 3060 12GB	4.2s	~45 tok/s
RTX 4070 12GB	3.9s	~80 tok/s
RTX 4090 24GB	3.5s	~120 tok/s

Next Steps

Learn about model formats and quantization
Set up a production API server with zServe
Explore streaming responses with zStream
Optimize memory with KV cache compression

You are ready to use ZSE! Check out the features documentation to learn more.

← Installation

Architecture →