Feature

GGUF Compatibility

Run GGUF models via llama.cpp backend for compatibility with existing model files.

Overview

ZSE includes full support for GGUF models via the llama-cpp-python backend. This lets you run existing GGUF models from Hugging Face or other sources while maintaining API compatibility with the rest of ZSE.

GGUF v2/v3

Full support for modern GGUF format versions

All Quant Types

Q4_K_M, Q5_K_M, Q8_0, and more

GPU Offloading

Configurable layer offloading to GPU

  • Parse GGUF v2/v3 format metadata
  • Support all GGML quantization types
  • Streaming and non-streaming generation
  • Chat completion support
  • GPU layer offloading configuration
  • Seamless integration with ZSE server

Supported Formats

ZSE supports all standard GGML quantization types found in GGUF files:

FormatBitsUse Case
Q4_K_M4-bitBest balance of size/quality
Q5_K_M5-bitHigher quality, slightly larger
Q8_08-bitNear-lossless quality
Q2_K2-bitMaximum compression
Q6_K6-bitHigh quality

Installation

CPU Only

bash
pip install llama-cpp-python

With CUDA (Recommended)

bash
CMAKE_ARGS="-DLLAMA_CUDA=on" pip install llama-cpp-python

With Metal (macOS)

bash
CMAKE_ARGS="-DLLAMA_METAL=on" pip install llama-cpp-python
GPU acceleration is highly recommended. CPU-only inference is 10-50x slower.

Usage

Python API

python
from zse.gguf import GGUFWrapper, is_gguf_file
# Check if file is GGUF format
if is_gguf_file("model-Q4_K_M.gguf"):
# Create wrapper (matches IntelligenceOrchestrator API)
wrapper = GGUFWrapper("model-Q4_K_M.gguf")
wrapper.load()
# Streaming generation
for text in wrapper.generate("Hello, how are you?"):
print(text, end="")
# Chat completion
response = wrapper.chat([
{"role": "user", "content": "Write a haiku about coding"}
])
print(response)

CLI

bash
# Auto-detect GGUF format and serve
zse serve model-Q4_K_M.gguf
# Show GGUF metadata
zse info model-Q4_K_M.gguf
# Run inference directly
zse infer model-Q4_K_M.gguf --prompt "Hello, world!"

Reading Metadata

python
from zse.gguf import GGUFReader
reader = GGUFReader("model-Q4_K_M.gguf")
metadata = reader.read_metadata()
print(f"Architecture: {metadata['architecture']}")
print(f"Context Length: {metadata['context_length']}")
print(f"Layers: {metadata['num_layers']}")
print(f"Quantization: {metadata['quantization_type']}")

GPU Offloading

Configure how many layers to offload to GPU for faster inference:

python
from zse.gguf import GGUFWrapper
# Offload all layers to GPU (fastest, requires most VRAM)
wrapper = GGUFWrapper("model.gguf", n_gpu_layers=-1)
# Offload first 20 layers to GPU
wrapper = GGUFWrapper("model.gguf", n_gpu_layers=20)
# CPU only (no GPU offloading)
wrapper = GGUFWrapper("model.gguf", n_gpu_layers=0)
# Auto-detect optimal layers based on available VRAM
wrapper = GGUFWrapper("model.gguf") # Default behavior
The more layers offloaded to GPU, the faster inference will be. However, you need sufficient VRAM. If you run out of memory, reduce n_gpu_layers.

.zse vs GGUF

While GGUF is widely supported, native .zse format offers significant advantages:

Feature.zseGGUF
Memory AllocationStreaming (on-demand)Static (all at once)
Cold Start3.9s (7B model)10-30s typical
Memory EfficiencyLoad only needed layersFull model in RAM+VRAM
QuantizationINT4 @ full precisionVarious (Q4_K_M, etc.)
Use GGUF for compatibility with existing model files. Convert to .zse for optimal performance with ZSE's streaming inference engine.
bash
# Convert GGUF to .zse for better performance
zse convert model-Q4_K_M.gguf -o model.zse