Core Concepts

Model Formats

Understanding different model formats supported by ZSE: .zse, GGUF, and safetensors.

Overview

ZSE supports multiple model formats, each with different tradeoffs for loading speed, memory usage, and compatibility.

.zse

Native format — fastest loading, best optimization

GGUF

Ollama/llama.cpp compatible — good portability

Safetensors

HuggingFace format — universal compatibility

.zse Format

The native ZSE format offers the fastest loading times and best memory efficiency.

  • Pre-quantized weights — no runtime quantization
  • Memory-mapped loading — instant access
  • Optimized tensor layout — sequential reads
  • Built-in tokenizer and config

Structure:

text
model.zse
├── header.json # Model metadata
├── config.json # Model configuration
├── tokenizer/ # Tokenizer files
│ ├── vocab.json
│ └── merges.txt
└── tensors/ # Quantized weights
├── embed.bin
├── layer_0.bin
├── layer_1.bin
└── ...

Creating .zse files:

bash
# From HuggingFace model
zse convert Qwen/Qwen2.5-7B-Instruct -o qwen-7b.zse
# From local safetensors
zse convert ./my-model/ -o my-model.zse
# With specific quantization
zse convert model -o model.zse --quant nf4

GGUF Format

GGUF (GPT-Generated Unified Format) is used by llama.cpp and Ollama. ZSE can import GGUF files directly.

Install GGUF support: pip install zllm-zse[gguf]

Loading GGUF models:

bash
# Serve GGUF directly
zse serve ./qwen-7b-q4_k_m.gguf
# Convert GGUF to .zse for faster loading
zse convert ./model.gguf -o model.zse

Supported GGUF quantization types:

TypeBitsDescription
Q4_04Basic 4-bit quantization
Q4_K_M4K-quants medium (recommended)
Q5_K_M5Higher quality
Q8_08Best quality

Safetensors

Safetensors is the standard format for HuggingFace models. ZSE can load safetensors directly with runtime quantization.

bash
# Load from HuggingFace Hub
zse serve Qwen/Qwen2.5-7B-Instruct
# Load local safetensors
zse serve ./my-local-model/
Loading safetensors requires runtime quantization, which adds 30-60 seconds to cold start time. For production, convert to .zse format.

Comparison

Choose the right format for your use case:

Feature.zseGGUFSafetensors
Cold Start (7B)3.9s~15s~45s
Pre-quantized
Memory-mapped
PortabilityZSE onlyOllama, llama.cppUniversal
Best ForProductionCross-platformDevelopment

Recommendation

Use .zse for production deployments where cold start time matters. Use GGUF if you need compatibility with Ollama. Use safetensors for quick experimentation.