Getting Started

Introduction

ZSE (Z Server Engine) is an ultra memory-efficient LLM inference engine designed for fast cold starts and low memory usage.

What is ZSE?

ZSE is an inference engine that loads large language models in seconds, not minutes. It achieves this through pre-quantized model formats that skip runtime quantization entirely.

Why ZSE?

Traditional inference engines like vLLM and transformers spend 30-60 seconds quantizing models at startup. ZSE's .zse format eliminates this overhead, enabling sub-4-second cold starts for 7B models.

Whether you're building serverless AI endpoints, developing locally, or deploying to production, ZSE helps you iterate faster and reduce costs.

Key Features

zQuantize

Pre-quantize models to INT4/NF4 format for instant loading

zServe

OpenAI-compatible API server with streaming support

zInfer

CLI tool for quick model testing and inference

zStream

Layer streaming for running large models on limited VRAM

zKV

Quantized KV cache for 4× memory savings

OpenAI API

Drop-in replacement for OpenAI's chat completions API

  • 3.9s cold start for 7B models (11.6× faster than bitsandbytes)
  • 21.4s cold start for 32B models (5.6× faster)
  • 63-72% memory savings with INT4 quantization
  • GGUF model import support
  • Multi-model management
  • Streaming token generation

Benchmarks

Cold start benchmarks on A100-80GB with Qwen 2.5 Coder models:

ModelbitsandbytesZSE (.zse)Speedup
7B45.3s3.9s11.6×
14B78.2s8.1s9.7×
32B120.0s21.4s5.6×
32B models require 24GB+ VRAM. Use zStream layer offloading for GPUs with less memory.

Quick Install

Install ZSE from PyPI:

bash
pip install zllm-zse

For GGUF model support, install with the optional dependency:

bash
pip install zllm-zse[gguf]

Start a server with a pre-trained model:

bash
# Start the server
zse serve Qwen/Qwen2.5-7B-Instruct
# Or with custom settings
zse serve Qwen/Qwen2.5-7B-Instruct --port 8080 --host 0.0.0.0

Next Steps