Feature

zQuantize

Convert models to optimized .zse format with extreme compression and minimal quality loss.

Overview

zQuantize converts transformer models from HuggingFace, safetensors, or GGUF to ZSE's native format with configurable quantization.

  • 4-bit and 8-bit quantization with calibration
  • NF4 (NormalFloat4) for best quality
  • Group quantization for accuracy preservation
  • Mixed precision for critical layers
  • CUDA-accelerated conversion

Quick Start

1

Convert a model

Run the convert command with default settings (NF4)

zse convert Qwen/Qwen2.5-7B-Instruct -o qwen-7b.zse
2

Verify the conversion

Check model info and size

zse info qwen-7b.zse
3

Test the model

Run a quick inference test

zse chat qwen-7b.zse -p "Hello, how are you?"
Conversion time depends on model size and GPU availability. A 7B model takes ~2 minutes on GPU or ~15 minutes on CPU.

Quantization Types

Choose the right quantization type for your memory/quality tradeoff:

NF4 (Default)

Best quality 4-bit — normalized float distribution

INT4

Standard 4-bit — fastest conversion speed

INT8

8-bit integer — higher quality, larger files

FP16

Half precision — maximum quality, largest files

NF4 Quantization

NormalFloat4 is optimized for the weight distribution of neural networks:

bash
zse convert model-id -o model.zse --quant nf4
  • Asymmetric quantization grid
  • Optimal for normally-distributed weights
  • Industry-leading 4-bit quality
  • Default for most models

INT4 Quantization

Standard symmetric 4-bit integer quantization:

bash
zse convert model-id -o model.zse --quant int4
  • Symmetric quantization grid
  • Faster conversion than NF4
  • Compatible with more hardware
  • Slightly lower quality than NF4

INT8 Quantization

8-bit quantization for higher quality:

bash
zse convert model-id -o model.zse --quant int8

FP16 (No Quantization)

Convert without quantization — useful for fine-tuning or maximum quality:

bash
zse convert model-id -o model.zse --quant fp16
Type7B SizeQualitySpeed
nf44.2 GB★★★★☆★★★★☆
int44.0 GB★★★☆☆★★★★★
int87.5 GB★★★★★★★★☆☆
fp1614 GB★★★★★★★☆☆☆

Advanced Options

Group Size

Control quantization granularity. Smaller groups = better quality, larger files:

bash
# Default group size (128)
zse convert model -o model.zse --quant nf4
# Smaller groups for better quality
zse convert model -o model.zse --quant nf4 --group-size 64
# Larger groups for smaller files
zse convert model -o model.zse --quant nf4 --group-size 256

Calibration Dataset

Use calibration data for optimal quantization ranges:

bash
# Use built-in calibration (default)
zse convert model -o model.zse --calibrate
# Use custom calibration data
zse convert model -o model.zse --calibrate-data ./prompts.txt
# Skip calibration (faster, slightly lower quality)
zse convert model -o model.zse --no-calibrate
Custom calibration with domain-specific data can improve quality for specialized tasks.

Mixed Precision

Keep critical layers at higher precision:

bash
# Keep embedding and output layers at FP16
zse convert model -o model.zse --quant nf4 --mixed-precision
# Keep specific layers at higher precision
zse convert model -o model.zse --quant nf4 --keep-fp16 "embed,lm_head"

Batch Conversion

Convert multiple models in a script:

bash
#!/bin/bash
MODELS=(
"Qwen/Qwen2.5-7B-Instruct"
"Qwen/Qwen2.5-14B-Instruct"
"meta-llama/Llama-3.1-8B-Instruct"
)
for model in "${MODELS[@]}"; do
name=$(basename "$model" | tr '[:upper:]' '[:lower:]')
zse convert "$model" -o "./models/$name.zse" --quant nf4
done

Python API for programmatic conversion:

python
from zllm_zse import convert_model
# Convert with options
convert_model(
source="Qwen/Qwen2.5-7B-Instruct",
output="qwen-7b.zse",
quant="nf4",
group_size=128,
calibrate=True
)
# Convert multiple models
models = ["model-a", "model-b", "model-c"]
for model in models:
convert_model(model, f"{model}.zse")

Quality Validation

Verify quantized model quality with built-in benchmarks:

bash
# Run perplexity benchmark
zse benchmark qwen-7b.zse --metric perplexity
# Compare with original
zse benchmark qwen-7b.zse --compare Qwen/Qwen2.5-7B-Instruct
# Full evaluation suite
zse benchmark qwen-7b.zse --eval mmlu,hellaswag,arc

Example output:

text
┌─────────────────────────────────────────────────────────┐
│ Model: qwen-7b.zse (NF4, 4.2 GB) │
├─────────────────────────────────────────────────────────┤
│ Perplexity: 5.42 (original: 5.38, Δ +0.7%) │
│ MMLU: 64.2% (original: 64.8%, Δ -0.6%) │
│ HellaSwag: 78.1% (original: 78.9%, Δ -0.8%) │
│ ARC-Challenge: 52.3% (original: 53.1%, Δ -0.8%) │
└─────────────────────────────────────────────────────────┘
Less than 1% quality loss is typical with NF4 quantization.