Your First Model
Download, convert, and run your first LLM with ZSE in under 5 minutes.
Choosing a Model
Choose a model based on your GPU memory:
| GPU VRAM | Recommended Model | HuggingFace ID |
|---|---|---|
| 6-8 GB | Qwen 2.5 3B | Qwen/Qwen2.5-3B-Instruct |
| 8-12 GB | Qwen 2.5 7B ✓ | Qwen/Qwen2.5-7B-Instruct |
| 12-16 GB | Llama 3.1 8B | meta-llama/Llama-3.1-8B-Instruct |
| 24+ GB | Qwen 2.5 14B | Qwen/Qwen2.5-14B-Instruct |
Not sure? Check your GPU with
zse hardwareDownload & Convert
1
Check your hardware
Verify GPU is detected
zse hardware2
Convert the model
Download from HuggingFace and convert to .zse format
zse convert Qwen/Qwen2.5-7B-Instruct -o qwen-7b.zse3
Verify the conversion
Check model info
zse info qwen-7b.zseFirst conversion downloads ~7GB and takes ~2 minutes on GPU (longer on CPU). Files are cached for future conversions.
Running Your Model
Interactive Chat
bash
zse chat qwen-7b.zseAPI Server
bash
# Start the serverzse serve qwen-7b.zse # Test with curlcurl http://localhost:8000/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{"model": "qwen-7b", "messages": [{"role": "user", "content": "Hello!"}]}'Python Code
python
from zllm_zse import ZSE model = ZSE("qwen-7b.zse")response = model.chat([ {"role": "user", "content": "Hello!"}])print(response)Verify Everything Works
Run a quick benchmark to verify your setup:
bash
zse benchmark qwen-7b.zseExpected results on consumer GPUs:
| GPU | Cold Start | Throughput |
|---|---|---|
| RTX 3060 12GB | 4.2s | ~45 tok/s |
| RTX 4070 12GB | 3.9s | ~80 tok/s |
| RTX 4090 24GB | 3.5s | ~120 tok/s |
Next Steps
- Learn about model formats and quantization
- Set up a production API server with zServe
- Explore streaming responses with zStream
- Optimize memory with KV cache compression
You are ready to use ZSE! Check out the features documentation to learn more.