Feature

GGUF Compatibility

Run GGUF models via llama.cpp backend for compatibility with existing model files.

Overview

ZSE includes full support for GGUF models via the llama-cpp-python backend. This lets you run existing GGUF models from Hugging Face or other sources while maintaining API compatibility with the rest of ZSE.

GGUF v2/v3

Full support for modern GGUF format versions

All Quant Types

Q4_K_M, Q5_K_M, Q8_0, and more

GPU Offloading

Configurable layer offloading to GPU

Parse GGUF v2/v3 format metadata
Support all GGML quantization types
Streaming and non-streaming generation
Chat completion support
GPU layer offloading configuration
Seamless integration with ZSE server

Supported Formats

ZSE supports all standard GGML quantization types found in GGUF files:

Format	Bits	Use Case
Q4_K_M	4-bit	Best balance of size/quality
Q5_K_M	5-bit	Higher quality, slightly larger
Q8_0	8-bit	Near-lossless quality
Q2_K	2-bit	Maximum compression
Q6_K	6-bit	High quality

Installation

CPU Only

bash

pip install llama-cpp-python

With CUDA (Recommended)

bash

CMAKE_ARGS="-DLLAMA_CUDA=on" pip install llama-cpp-python

With Metal (macOS)

bash

CMAKE_ARGS="-DLLAMA_METAL=on" pip install llama-cpp-python

GPU acceleration is highly recommended. CPU-only inference is 10-50x slower.

Usage

Python API

python

from zse.gguf import GGUFWrapper, is_gguf_file
 
# Check if file is GGUF format
if is_gguf_file("model-Q4_K_M.gguf"):
    # Create wrapper (matches IntelligenceOrchestrator API)
    wrapper = GGUFWrapper("model-Q4_K_M.gguf")
    wrapper.load()
    
    # Streaming generation
    for text in wrapper.generate("Hello, how are you?"):
        print(text, end="")
 
    # Chat completion
    response = wrapper.chat([
        {"role": "user", "content": "Write a haiku about coding"}
    ])
    print(response)

CLI

bash

# Auto-detect GGUF format and serve
zse serve model-Q4_K_M.gguf
 
# Show GGUF metadata
zse info model-Q4_K_M.gguf
 
# Run inference directly
zse infer model-Q4_K_M.gguf --prompt "Hello, world!"

Reading Metadata

python

from zse.gguf import GGUFReader
 
reader = GGUFReader("model-Q4_K_M.gguf")
metadata = reader.read_metadata()
 
print(f"Architecture: {metadata['architecture']}")
print(f"Context Length: {metadata['context_length']}")
print(f"Layers: {metadata['num_layers']}")
print(f"Quantization: {metadata['quantization_type']}")

GPU Offloading

Configure how many layers to offload to GPU for faster inference:

python

from zse.gguf import GGUFWrapper
 
# Offload all layers to GPU (fastest, requires most VRAM)
wrapper = GGUFWrapper("model.gguf", n_gpu_layers=-1)
 
# Offload first 20 layers to GPU
wrapper = GGUFWrapper("model.gguf", n_gpu_layers=20)
 
# CPU only (no GPU offloading)
wrapper = GGUFWrapper("model.gguf", n_gpu_layers=0)
 
# Auto-detect optimal layers based on available VRAM
wrapper = GGUFWrapper("model.gguf")  # Default behavior

The more layers offloaded to GPU, the faster inference will be. However, you need sufficient VRAM. If you run out of memory, reduce n_gpu_layers.

.zse vs GGUF

While GGUF is widely supported, native .zse format offers significant advantages:

Feature	.zse	GGUF
Memory Allocation	Streaming (on-demand)	Static (all at once)
Cold Start	3.9s (7B model)	10-30s typical
Memory Efficiency	Load only needed layers	Full model in RAM+VRAM
Quantization	INT4 @ full precision	Various (Q4_K_M, etc.)

Use GGUF for compatibility with existing model files. Convert to .zse for optimal performance with ZSE's streaming inference engine.

bash

# Convert GGUF to .zse for better performance
zse convert model-Q4_K_M.gguf -o model.zse

← Multi-GPU

CLI Commands →