Ultra Memory-Efficient
LLM Inference

Load 7B models in 3.9 seconds. Run 32B models in 19GB VRAM. OpenAI-compatible API out of the box.

$pip install zllm-zse
0.0s

Cold Start Time

0%

Memory Saved

0.0×

Faster Loading

0%

API Compatible

Get Running in 3 Steps

From zero to serving models in under a minute

01

Install ZSE

One pip command to get started. No complex dependencies or configurations.

02

Convert to .zse

Convert any HuggingFace model to optimized .zse format with 11× faster loading.

03

Serve Your Model

Start the OpenAI-compatible API server with instant cold starts.

04

Query the API

Use the OpenAI-compatible API with any client library or framework.

terminal
$ pip install zllm-zse

Why ZSE?

ZSE solves a real problem: loading large models with bitsandbytes is slow because it quantizes on every load.

bitsandbytes (Standard)

Every time you load a model:

  1. Download FP16 weights (14GB for 7B model)
  2. Quantize to INT4 (takes 40+ seconds)
  3. Finally ready to use
Qwen 7B Load Time:45.4s

.zse Format (Pre-quantized)

With ZSE, you quantize once, load instantly:

  1. One-time: zse quantize → .zse file
  2. Every load: Read pre-quantized weights (instant)
  3. Ready in seconds, not minutes
Qwen 7B Load Time:3.9s

For Developers

When to use bitsandbytes: Quick experiments, testing different models, one-off runs.

When to use .zse: Production deployments, serverless functions, CI/CD pipelines, anywhere you load the same model repeatedly and need fast cold starts.

Verified Benchmarks

.zse format vs bitsandbytes on-the-fly quantization. Tested on A100-80GB.

ModelbitsandbytesZSE (.zse)VRAMSpeedup
Qwen 7B45.4s3.9s5.2GB11.6×
Qwen 32B120.0s21.4s19.3GB5.6×

Load Time Comparison (Qwen 7B)

bitsandbytes45.4s
ZSE3.9s

Built for Efficiency

Every feature designed for memory efficiency and fast cold starts

11.6×
Faster

3.9s Cold Start

Load Qwen 7B in under 4 seconds. 11.6× faster than bitsandbytes. No more waiting.

70%
Less VRAM

63-70% Memory Savings

Run 32B models in 19GB VRAM. Fit larger models on your existing hardware.

3
Platforms

GPU + CPU Support

Auto-detect hardware. Run on CUDA GPUs, Apple Silicon, or CPU-only setups.

100%
Compatible

OpenAI Compatible

Drop-in replacement API. Works with LangChain, OpenAI SDK, and your existing code.

Perfect For

From local development to production deployments

Serverless Inference

Sub-5s cold starts make ZSE perfect for serverless deployments where every millisecond of startup time costs money.

Local AI Development

Run large models on your laptop. Test and iterate without cloud costs or API rate limits.

Edge Deployment

Memory-efficient enough for edge devices. Deploy AI at the edge without expensive hardware.

Cost Optimization

Fit larger models on smaller GPUs. Cut your cloud compute bills by up to 70%.

Verified Models

Tested and optimized for ZSE. VRAM shown for INT4 quantization.

ModelProviderCategoryVRAM (INT4).zse Ready
Qwen 2.5 7BAlibabaChat/Code4.5 GB
Qwen 2.5 32BAlibabaChat/Code19 GB
Mistral 7B v0.3Mistral AIChat4.5 GB
DeepSeek Coder 6.7BDeepSeekCode4 GB
Llama 3.2 3BMetaChat2 GB
Gemma 2 9BGoogleReasoning5.5 GB
Phi-3 MiniMicrosoftReasoning2.4 GB
TinyLlama 1.1BTinyLlamaTesting0.7 GB

Simple, Powerful API

Start serving models with just a few lines

terminal
# Install ZSE
$ pip install zllm-zse

# Convert to optimized .zse format (11× faster loading)
$ zse quantize Qwen/Qwen2.5-7B-Instruct -o ./model.zse

# Serve your model with instant cold starts
$ zse serve ./model.zse --port 8000

# OpenAI-compatible API is ready!
$ curl localhost:8000/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{"model":"default","messages":[{"role":"user","content":"Hello!"}]}'

Ready to Try ZSE?

Get memory-efficient LLM inference with fast cold starts. Install and start serving in minutes.

Apache 2.0 Licensed
Open Source
PyPI Published