Complete Guide: Running Your First Model with ZSE
Complete Guide: Running Your First Model with ZSE
This guide walks you through installing ZSE and running your first LLM inference in under 5 minutes.
Prerequisites
• Python 3.9+
• CUDA 11.8+ (for GPU) or CPU-only support
• 8GB+ VRAM for 7B models
Step 1: Install ZSE
pip install zllm-zse
Verify installation:
zse --version
zse hardware # Check GPU detection
Step 2: Convert a Model
Convert a HuggingFace model to .zse format:
zse convert Qwen/Qwen2.5-7B-Instruct -o qwen-7b.zse
This downloads the model (~14GB), quantizes it to NF4, and saves a 4.2GB .zse file.
Step 3: Start the Server
zse serve qwen-7b.zse --port 8000
Step 4: Send a Request
Using curl:
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"model": "qwen-7b", "messages": [{"role": "user", "content": "Hello!"}]}'
Or with Python:
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="not-needed")
response = client.chat.completions.create(
model="qwen-7b",
messages=[{"role": "user", "content": "Explain quantum computing"}]
)
print(response.choices[0].message.content)
Next Steps
• Enable streaming with `stream=True`
• Try different quantization: `--quant int4` or `--quant int8`
• Read the [Documentation](/docs) for advanced features
Related Posts
Running 70B Models on a 24GB GPU with ZSE
How to run Llama 70B and other large models on consumer GPUs using ZSE's memory optimization features.
Building a Local RAG Chatbot with ZSE
Create a retrieval-augmented generation chatbot that answers questions about your documents.
Streaming Responses with ZSE: Real-time Token Generation
Implement real-time streaming for chat applications with minimal time-to-first-token.