Model Formats
Understanding different model formats supported by ZSE: .zse, GGUF, and safetensors.
Overview
ZSE supports multiple model formats, each with different tradeoffs for loading speed, memory usage, and compatibility.
.zse
Native format — fastest loading, best optimization
GGUF
Ollama/llama.cpp compatible — good portability
Safetensors
HuggingFace format — universal compatibility
.zse Format
The native ZSE format offers the fastest loading times and best memory efficiency.
- Pre-quantized weights — no runtime quantization
- Memory-mapped loading — instant access
- Optimized tensor layout — sequential reads
- Built-in tokenizer and config
Structure:
model.zse├── header.json # Model metadata├── config.json # Model configuration├── tokenizer/ # Tokenizer files│ ├── vocab.json│ └── merges.txt└── tensors/ # Quantized weights ├── embed.bin ├── layer_0.bin ├── layer_1.bin └── ...Creating .zse files:
# From HuggingFace modelzse convert Qwen/Qwen2.5-7B-Instruct -o qwen-7b.zse # From local safetensorszse convert ./my-model/ -o my-model.zse # With specific quantizationzse convert model -o model.zse --quant nf4GGUF Format
GGUF (GPT-Generated Unified Format) is used by llama.cpp and Ollama. ZSE can import GGUF files directly.
pip install zllm-zse[gguf]Loading GGUF models:
# Serve GGUF directlyzse serve ./qwen-7b-q4_k_m.gguf # Convert GGUF to .zse for faster loadingzse convert ./model.gguf -o model.zseSupported GGUF quantization types:
| Type | Bits | Description |
|---|---|---|
Q4_0 | 4 | Basic 4-bit quantization |
Q4_K_M | 4 | K-quants medium (recommended) |
Q5_K_M | 5 | Higher quality |
Q8_0 | 8 | Best quality |
Safetensors
Safetensors is the standard format for HuggingFace models. ZSE can load safetensors directly with runtime quantization.
# Load from HuggingFace Hubzse serve Qwen/Qwen2.5-7B-Instruct # Load local safetensorszse serve ./my-local-model/.zse format.Comparison
Choose the right format for your use case:
| Feature | .zse | GGUF | Safetensors |
|---|---|---|---|
| Cold Start (7B) | 3.9s | ~15s | ~45s |
| Pre-quantized | ✓ | ✓ | ✗ |
| Memory-mapped | ✓ | ✓ | ✓ |
| Portability | ZSE only | Ollama, llama.cpp | Universal |
| Best For | Production | Cross-platform | Development |
Recommendation
.zse for production deployments where cold start time matters. Use GGUF if you need compatibility with Ollama. Use safetensors for quick experimentation.