zQuantize
Convert models to optimized .zse format with extreme compression and minimal quality loss.
Overview
zQuantize converts transformer models from HuggingFace, safetensors, or GGUF to ZSE's native format with configurable quantization.
- 4-bit and 8-bit quantization with calibration
- NF4 (NormalFloat4) for best quality
- Group quantization for accuracy preservation
- Mixed precision for critical layers
- CUDA-accelerated conversion
Quick Start
Convert a model
Run the convert command with default settings (NF4)
zse convert Qwen/Qwen2.5-7B-Instruct -o qwen-7b.zseVerify the conversion
Check model info and size
zse info qwen-7b.zseTest the model
Run a quick inference test
zse chat qwen-7b.zse -p "Hello, how are you?"Quantization Types
Choose the right quantization type for your memory/quality tradeoff:
NF4 (Default)
Best quality 4-bit — normalized float distribution
INT4
Standard 4-bit — fastest conversion speed
INT8
8-bit integer — higher quality, larger files
FP16
Half precision — maximum quality, largest files
NF4 Quantization
NormalFloat4 is optimized for the weight distribution of neural networks:
zse convert model-id -o model.zse --quant nf4- Asymmetric quantization grid
- Optimal for normally-distributed weights
- Industry-leading 4-bit quality
- Default for most models
INT4 Quantization
Standard symmetric 4-bit integer quantization:
zse convert model-id -o model.zse --quant int4- Symmetric quantization grid
- Faster conversion than NF4
- Compatible with more hardware
- Slightly lower quality than NF4
INT8 Quantization
8-bit quantization for higher quality:
zse convert model-id -o model.zse --quant int8FP16 (No Quantization)
Convert without quantization — useful for fine-tuning or maximum quality:
zse convert model-id -o model.zse --quant fp16| Type | 7B Size | Quality | Speed |
|---|---|---|---|
nf4 | 4.2 GB | ★★★★☆ | ★★★★☆ |
int4 | 4.0 GB | ★★★☆☆ | ★★★★★ |
int8 | 7.5 GB | ★★★★★ | ★★★☆☆ |
fp16 | 14 GB | ★★★★★ | ★★☆☆☆ |
Advanced Options
Group Size
Control quantization granularity. Smaller groups = better quality, larger files:
# Default group size (128)zse convert model -o model.zse --quant nf4 # Smaller groups for better qualityzse convert model -o model.zse --quant nf4 --group-size 64 # Larger groups for smaller fileszse convert model -o model.zse --quant nf4 --group-size 256Calibration Dataset
Use calibration data for optimal quantization ranges:
# Use built-in calibration (default)zse convert model -o model.zse --calibrate # Use custom calibration datazse convert model -o model.zse --calibrate-data ./prompts.txt # Skip calibration (faster, slightly lower quality)zse convert model -o model.zse --no-calibrateMixed Precision
Keep critical layers at higher precision:
# Keep embedding and output layers at FP16zse convert model -o model.zse --quant nf4 --mixed-precision # Keep specific layers at higher precisionzse convert model -o model.zse --quant nf4 --keep-fp16 "embed,lm_head"Batch Conversion
Convert multiple models in a script:
#!/bin/bashMODELS=( "Qwen/Qwen2.5-7B-Instruct" "Qwen/Qwen2.5-14B-Instruct" "meta-llama/Llama-3.1-8B-Instruct") for model in "${MODELS[@]}"; do name=$(basename "$model" | tr '[:upper:]' '[:lower:]') zse convert "$model" -o "./models/$name.zse" --quant nf4donePython API for programmatic conversion:
from zllm_zse import convert_model # Convert with optionsconvert_model( source="Qwen/Qwen2.5-7B-Instruct", output="qwen-7b.zse", quant="nf4", group_size=128, calibrate=True) # Convert multiple modelsmodels = ["model-a", "model-b", "model-c"]for model in models: convert_model(model, f"{model}.zse")Quality Validation
Verify quantized model quality with built-in benchmarks:
# Run perplexity benchmarkzse benchmark qwen-7b.zse --metric perplexity # Compare with originalzse benchmark qwen-7b.zse --compare Qwen/Qwen2.5-7B-Instruct # Full evaluation suitezse benchmark qwen-7b.zse --eval mmlu,hellaswag,arcExample output:
┌─────────────────────────────────────────────────────────┐│ Model: qwen-7b.zse (NF4, 4.2 GB) │├─────────────────────────────────────────────────────────┤│ Perplexity: 5.42 (original: 5.38, Δ +0.7%) ││ MMLU: 64.2% (original: 64.8%, Δ -0.6%) ││ HellaSwag: 78.1% (original: 78.9%, Δ -0.8%) ││ ARC-Challenge: 52.3% (original: 53.1%, Δ -0.8%) │└─────────────────────────────────────────────────────────┘