Core Concepts

Quantization

Understanding ZSE's quantization options and how they affect performance, memory, and output quality.

Overview

Quantization reduces model precision from 16-bit floats to smaller formats like 4-bit integers. This dramatically reduces memory usage and improves inference speed with minimal quality loss.

63-72%

Memory reduction with INT4

11.6×

Faster cold starts

< 1%

Quality degradation

How It Works

Traditional inference engines quantize models at runtime, which takes 30-60 seconds for 7B models. ZSE pre-quantizes models to the .zse format, eliminating this overhead.

Original Model
14 GB (FP16)
zQuantize
Pre-quantize
.zse File
4.2 GB (INT4)

The .zse format stores:

  • Pre-quantized weights in INT4/NF4 format
  • Per-tensor scaling factors for accuracy
  • Model architecture and tokenizer
  • Optimized memory layout for fast loading

Quantization Types

ZSE supports multiple quantization formats:

TypeBitsMemoryQualityUse Case
int44BestGoodDefault, production
nf44BestBetterHigher quality 4-bit
int88GoodBestQuality-sensitive tasks
fp1616BaselinePerfectReference, unlimited VRAM

Recommendation

Use nf4 (Normalized Float 4) for the best balance of quality and memory. It uses a non-linear quantization scheme that better preserves model accuracy.

Converting Models

Convert any HuggingFace model to .zse format:

bash
# INT4 (default, smallest size)
zse convert Qwen/Qwen2.5-7B-Instruct -o qwen-7b.zse
# NF4 (recommended, better quality)
zse convert Qwen/Qwen2.5-7B-Instruct -o qwen-7b-nf4.zse --quant nf4
# INT8 (larger but higher quality)
zse convert Qwen/Qwen2.5-7B-Instruct -o qwen-7b-int8.zse --quant int8

Conversion time depends on model size:

Model SizeConversion Time
7B~20 seconds
14B~45 seconds
32B~2 minutes

Quality Comparison

Benchmark results on common evaluation tasks (higher is better):

BenchmarkFP16INT8NF4INT4
MMLU68.267.967.566.8
HumanEval61.060.459.858.5
GSM8K82.582.181.479.8
These benchmarks are for Qwen 2.5 7B. Quality retention varies by model architecture, but most modern LLMs maintain 95%+ of their original capability with NF4 quantization.