Core Concepts

Model Formats

Understanding different model formats supported by ZSE: .zse, GGUF, and safetensors.

Overview

ZSE supports multiple model formats, each with different tradeoffs for loading speed, memory usage, and compatibility.

.zse

Native format — fastest loading, best optimization

GGUF

Ollama/llama.cpp compatible — good portability

Safetensors

HuggingFace format — universal compatibility

.zse Format

The native ZSE format offers the fastest loading times and best memory efficiency.

Pre-quantized weights — no runtime quantization
Memory-mapped loading — instant access
Optimized tensor layout — sequential reads
Built-in tokenizer and config

Structure:

text

model.zse
├── header.json          # Model metadata
├── config.json          # Model configuration
├── tokenizer/           # Tokenizer files
│   ├── vocab.json
│   └── merges.txt
└── tensors/             # Quantized weights
    ├── embed.bin
    ├── layer_0.bin
    ├── layer_1.bin
    └── ...

Creating .zse files:

bash

# From HuggingFace model
zse convert Qwen/Qwen2.5-7B-Instruct -o qwen-7b.zse
 
# From local safetensors
zse convert ./my-model/ -o my-model.zse
 
# With specific quantization
zse convert model -o model.zse --quant nf4

GGUF Format

GGUF (GPT-Generated Unified Format) is used by llama.cpp and Ollama. ZSE can import GGUF files directly.

Install GGUF support: pip install zllm-zse[gguf]

Loading GGUF models:

bash

# Serve GGUF directly
zse serve ./qwen-7b-q4_k_m.gguf
 
# Convert GGUF to .zse for faster loading
zse convert ./model.gguf -o model.zse

Supported GGUF quantization types:

Type	Bits	Description
`Q4_0`	4	Basic 4-bit quantization
`Q4_K_M`	4	K-quants medium (recommended)
`Q5_K_M`	5	Higher quality
`Q8_0`	8	Best quality

Safetensors

Safetensors is the standard format for HuggingFace models. ZSE can load safetensors directly with runtime quantization.

bash

# Load from HuggingFace Hub
zse serve Qwen/Qwen2.5-7B-Instruct
 
# Load local safetensors
zse serve ./my-local-model/

Loading safetensors requires runtime quantization, which adds 30-60 seconds to cold start time. For production, convert to .zse format.

Comparison

Choose the right format for your use case:

Feature	.zse	GGUF	Safetensors
Cold Start (7B)	3.9s	~15s	~45s
Pre-quantized	✓	✓	✗
Memory-mapped	✓	✓	✓
Portability	ZSE only	Ollama, llama.cpp	Universal
Best For	Production	Cross-platform	Development

Recommendation

Use .zse for production deployments where cold start time matters. Use GGUF if you need compatibility with Ollama. Use safetensors for quick experimentation.

← Architecture

Quantization →