Back to blog
announcementperformance

Introducing ZSE: 3.9s Cold Starts for LLM Inference

ZSE TeamFebruary 25, 20265 min read

Introducing ZSE

Today we're releasing ZSE (Z Server Engine), an ultra memory-efficient LLM inference engine that achieves **3.9 second cold starts** for 7B models.

The Problem

Loading large language models is slow. A typical 7B model with bitsandbytes takes 45+ seconds to load. This makes serverless deployments expensive and development iteration slow.

Our Solution

ZSE introduces the .zse format - pre-quantized model files that skip runtime quantization entirely. The result:

7B models: 3.9s cold start (11.6× faster)

32B models: 21.4s cold start (5.6× faster)

**63-72% memory savings** compared to FP16

How It Works

1. **Pre-quantization**: Convert once, load fast forever

2. **Memory mapping**: Direct tensor loading from disk

3. **Lazy initialization**: Only load what's needed

4. **OpenAI-compatible API**: Drop-in replacement

Try It Now

pip install zllm-zse

zse convert Qwen/Qwen2.5-7B-Instruct -o qwen-7b.zse

zse serve qwen-7b.zse

We're just getting started. Follow us for updates on zStream, zKV, and more features.

Related Posts