GGUF Compatibility
Run GGUF models via llama.cpp backend for compatibility with existing model files.
Overview
ZSE includes full support for GGUF models via the llama-cpp-python backend. This lets you run existing GGUF models from Hugging Face or other sources while maintaining API compatibility with the rest of ZSE.
GGUF v2/v3
Full support for modern GGUF format versions
All Quant Types
Q4_K_M, Q5_K_M, Q8_0, and more
GPU Offloading
Configurable layer offloading to GPU
- Parse GGUF v2/v3 format metadata
- Support all GGML quantization types
- Streaming and non-streaming generation
- Chat completion support
- GPU layer offloading configuration
- Seamless integration with ZSE server
Supported Formats
ZSE supports all standard GGML quantization types found in GGUF files:
| Format | Bits | Use Case |
|---|---|---|
| Q4_K_M | 4-bit | Best balance of size/quality |
| Q5_K_M | 5-bit | Higher quality, slightly larger |
| Q8_0 | 8-bit | Near-lossless quality |
| Q2_K | 2-bit | Maximum compression |
| Q6_K | 6-bit | High quality |
Installation
CPU Only
bash
pip install llama-cpp-pythonWith CUDA (Recommended)
bash
CMAKE_ARGS="-DLLAMA_CUDA=on" pip install llama-cpp-pythonWith Metal (macOS)
bash
CMAKE_ARGS="-DLLAMA_METAL=on" pip install llama-cpp-pythonGPU acceleration is highly recommended. CPU-only inference is 10-50x slower.
Usage
Python API
python
from zse.gguf import GGUFWrapper, is_gguf_file # Check if file is GGUF formatif is_gguf_file("model-Q4_K_M.gguf"): # Create wrapper (matches IntelligenceOrchestrator API) wrapper = GGUFWrapper("model-Q4_K_M.gguf") wrapper.load() # Streaming generation for text in wrapper.generate("Hello, how are you?"): print(text, end="") # Chat completion response = wrapper.chat([ {"role": "user", "content": "Write a haiku about coding"} ]) print(response)CLI
bash
# Auto-detect GGUF format and servezse serve model-Q4_K_M.gguf # Show GGUF metadatazse info model-Q4_K_M.gguf # Run inference directlyzse infer model-Q4_K_M.gguf --prompt "Hello, world!"Reading Metadata
python
from zse.gguf import GGUFReader reader = GGUFReader("model-Q4_K_M.gguf")metadata = reader.read_metadata() print(f"Architecture: {metadata['architecture']}")print(f"Context Length: {metadata['context_length']}")print(f"Layers: {metadata['num_layers']}")print(f"Quantization: {metadata['quantization_type']}")GPU Offloading
Configure how many layers to offload to GPU for faster inference:
python
from zse.gguf import GGUFWrapper # Offload all layers to GPU (fastest, requires most VRAM)wrapper = GGUFWrapper("model.gguf", n_gpu_layers=-1) # Offload first 20 layers to GPUwrapper = GGUFWrapper("model.gguf", n_gpu_layers=20) # CPU only (no GPU offloading)wrapper = GGUFWrapper("model.gguf", n_gpu_layers=0) # Auto-detect optimal layers based on available VRAMwrapper = GGUFWrapper("model.gguf") # Default behaviorThe more layers offloaded to GPU, the faster inference will be. However, you need sufficient VRAM. If you run out of memory, reduce n_gpu_layers.
.zse vs GGUF
While GGUF is widely supported, native .zse format offers significant advantages:
| Feature | .zse | GGUF |
|---|---|---|
| Memory Allocation | Streaming (on-demand) | Static (all at once) |
| Cold Start | 3.9s (7B model) | 10-30s typical |
| Memory Efficiency | Load only needed layers | Full model in RAM+VRAM |
| Quantization | INT4 @ full precision | Various (Q4_K_M, etc.) |
Use GGUF for compatibility with existing model files. Convert to .zse for optimal performance with ZSE's streaming inference engine.
bash
# Convert GGUF to .zse for better performancezse convert model-Q4_K_M.gguf -o model.zse