Quantization
Understanding ZSE's quantization options and how they affect performance, memory, and output quality.
Overview
Quantization reduces model precision from 16-bit floats to smaller formats like 4-bit integers. This dramatically reduces memory usage and improves inference speed with minimal quality loss.
63-72%
Memory reduction with INT4
11.6×
Faster cold starts
< 1%
Quality degradation
How It Works
Traditional inference engines quantize models at runtime, which takes 30-60 seconds for 7B models. ZSE pre-quantizes models to the .zse format, eliminating this overhead.
The .zse format stores:
- Pre-quantized weights in INT4/NF4 format
- Per-tensor scaling factors for accuracy
- Model architecture and tokenizer
- Optimized memory layout for fast loading
Quantization Types
ZSE supports multiple quantization formats:
| Type | Bits | Memory | Quality | Use Case |
|---|---|---|---|---|
int4 | 4 | Best | Good | Default, production |
nf4 | 4 | Best | Better | Higher quality 4-bit |
int8 | 8 | Good | Best | Quality-sensitive tasks |
fp16 | 16 | Baseline | Perfect | Reference, unlimited VRAM |
Recommendation
nf4 (Normalized Float 4) for the best balance of quality and memory. It uses a non-linear quantization scheme that better preserves model accuracy.Converting Models
Convert any HuggingFace model to .zse format:
# INT4 (default, smallest size)zse convert Qwen/Qwen2.5-7B-Instruct -o qwen-7b.zse # NF4 (recommended, better quality)zse convert Qwen/Qwen2.5-7B-Instruct -o qwen-7b-nf4.zse --quant nf4 # INT8 (larger but higher quality)zse convert Qwen/Qwen2.5-7B-Instruct -o qwen-7b-int8.zse --quant int8Conversion time depends on model size:
| Model Size | Conversion Time |
|---|---|
| 7B | ~20 seconds |
| 14B | ~45 seconds |
| 32B | ~2 minutes |
Quality Comparison
Benchmark results on common evaluation tasks (higher is better):
| Benchmark | FP16 | INT8 | NF4 | INT4 |
|---|---|---|---|---|
| MMLU | 68.2 | 67.9 | 67.5 | 66.8 |
| HumanEval | 61.0 | 60.4 | 59.8 | 58.5 |
| GSM8K | 82.5 | 82.1 | 81.4 | 79.8 |