Core Concepts

Quantization

Understanding ZSE's quantization options and how they affect performance, memory, and output quality.

Overview

Quantization reduces model precision from 16-bit floats to smaller formats like 4-bit integers. This dramatically reduces memory usage and improves inference speed with minimal quality loss.

63-72%

Memory reduction with INT4

11.6×

Faster cold starts

< 1%

Quality degradation

How It Works

Traditional inference engines quantize models at runtime, which takes 30-60 seconds for 7B models. ZSE pre-quantizes models to the .zse format, eliminating this overhead.

Original Model

14 GB (FP16)

→

zQuantize

Pre-quantize

→

.zse File

4.2 GB (INT4)

The .zse format stores:

Pre-quantized weights in INT4/NF4 format
Per-tensor scaling factors for accuracy
Model architecture and tokenizer
Optimized memory layout for fast loading

Quantization Types

ZSE supports multiple quantization formats:

Type	Bits	Memory	Quality	Use Case
`int4`	4	Best	Good	Default, production
`nf4`	4	Best	Better	Higher quality 4-bit
`int8`	8	Good	Best	Quality-sensitive tasks
`fp16`	16	Baseline	Perfect	Reference, unlimited VRAM

Recommendation

Use nf4 (Normalized Float 4) for the best balance of quality and memory. It uses a non-linear quantization scheme that better preserves model accuracy.

Converting Models

Convert any HuggingFace model to .zse format:

bash

# INT4 (default, smallest size)
zse convert Qwen/Qwen2.5-7B-Instruct -o qwen-7b.zse
 
# NF4 (recommended, better quality)
zse convert Qwen/Qwen2.5-7B-Instruct -o qwen-7b-nf4.zse --quant nf4
 
# INT8 (larger but higher quality)
zse convert Qwen/Qwen2.5-7B-Instruct -o qwen-7b-int8.zse --quant int8

Conversion time depends on model size:

Model Size	Conversion Time
7B	~20 seconds
14B	~45 seconds
32B	~2 minutes

Quality Comparison

Benchmark results on common evaluation tasks (higher is better):

Benchmark	FP16	INT8	NF4	INT4
MMLU	68.2	67.9	67.5	66.8
HumanEval	61.0	60.4	59.8	58.5
GSM8K	82.5	82.1	81.4	79.8

These benchmarks are for Qwen 2.5 7B. Quality retention varies by model architecture, but most modern LLMs maintain 95%+ of their original capability with NF4 quantization.

← Model Formats

Memory Management →