Getting Started

Your First Model

Download, convert, and run your first LLM with ZSE in under 5 minutes.

Choosing a Model

Choose a model based on your GPU memory:

GPU VRAMRecommended ModelHuggingFace ID
6-8 GBQwen 2.5 3BQwen/Qwen2.5-3B-Instruct
8-12 GBQwen 2.5 7B ✓Qwen/Qwen2.5-7B-Instruct
12-16 GBLlama 3.1 8Bmeta-llama/Llama-3.1-8B-Instruct
24+ GBQwen 2.5 14BQwen/Qwen2.5-14B-Instruct
Not sure? Check your GPU with zse hardware

Download & Convert

1

Check your hardware

Verify GPU is detected

zse hardware
2

Convert the model

Download from HuggingFace and convert to .zse format

zse convert Qwen/Qwen2.5-7B-Instruct -o qwen-7b.zse
3

Verify the conversion

Check model info

zse info qwen-7b.zse
First conversion downloads ~7GB and takes ~2 minutes on GPU (longer on CPU). Files are cached for future conversions.

Running Your Model

Interactive Chat

bash
zse chat qwen-7b.zse

API Server

bash
# Start the server
zse serve qwen-7b.zse
# Test with curl
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"model": "qwen-7b", "messages": [{"role": "user", "content": "Hello!"}]}'

Python Code

python
from zllm_zse import ZSE
model = ZSE("qwen-7b.zse")
response = model.chat([
{"role": "user", "content": "Hello!"}
])
print(response)

Verify Everything Works

Run a quick benchmark to verify your setup:

bash
zse benchmark qwen-7b.zse

Expected results on consumer GPUs:

GPUCold StartThroughput
RTX 3060 12GB4.2s~45 tok/s
RTX 4070 12GB3.9s~80 tok/s
RTX 4090 24GB3.5s~120 tok/s

Next Steps

  • Learn about model formats and quantization
  • Set up a production API server with zServe
  • Explore streaming responses with zStream
  • Optimize memory with KV cache compression
You are ready to use ZSE! Check out the features documentation to learn more.