Ultra Memory-Efficient
LLM Inference
Load 7B models in 3.9 seconds. Run 32B models in 19GB VRAM. OpenAI-compatible API out of the box.
Cold Start Time
Memory Saved
Faster Loading
API Compatible
Get Running in 3 Steps
From zero to serving models in under a minute
Install ZSE
One pip command to get started. No complex dependencies or configurations.
Convert to .zse
Convert any HuggingFace model to optimized .zse format with 11× faster loading.
Serve Your Model
Start the OpenAI-compatible API server with instant cold starts.
Query the API
Use the OpenAI-compatible API with any client library or framework.
Why ZSE?
ZSE solves a real problem: loading large models with bitsandbytes is slow because it quantizes on every load.
bitsandbytes (Standard)
Every time you load a model:
- Download FP16 weights (14GB for 7B model)
- Quantize to INT4 (takes 40+ seconds)
- Finally ready to use
.zse Format (Pre-quantized)
With ZSE, you quantize once, load instantly:
- One-time:
zse quantize→ .zse file - Every load: Read pre-quantized weights (instant)
- Ready in seconds, not minutes
For Developers
When to use bitsandbytes: Quick experiments, testing different models, one-off runs.
When to use .zse: Production deployments, serverless functions, CI/CD pipelines, anywhere you load the same model repeatedly and need fast cold starts.
Verified Benchmarks
.zse format vs bitsandbytes on-the-fly quantization. Tested on A100-80GB.
| Model | bitsandbytes | ZSE (.zse) | VRAM | Speedup |
|---|---|---|---|---|
| Qwen 7B | 45.4s | 3.9s | 5.2GB | 11.6× |
| Qwen 32B | 120.0s | 21.4s | 19.3GB | 5.6× |
Load Time Comparison (Qwen 7B)
Built for Efficiency
Every feature designed for memory efficiency and fast cold starts
3.9s Cold Start
Load Qwen 7B in under 4 seconds. 11.6× faster than bitsandbytes. No more waiting.
63-70% Memory Savings
Run 32B models in 19GB VRAM. Fit larger models on your existing hardware.
GPU + CPU Support
Auto-detect hardware. Run on CUDA GPUs, Apple Silicon, or CPU-only setups.
OpenAI Compatible
Drop-in replacement API. Works with LangChain, OpenAI SDK, and your existing code.
Perfect For
From local development to production deployments
Serverless Inference
Sub-5s cold starts make ZSE perfect for serverless deployments where every millisecond of startup time costs money.
Local AI Development
Run large models on your laptop. Test and iterate without cloud costs or API rate limits.
Edge Deployment
Memory-efficient enough for edge devices. Deploy AI at the edge without expensive hardware.
Cost Optimization
Fit larger models on smaller GPUs. Cut your cloud compute bills by up to 70%.
Verified Models
Tested and optimized for ZSE. VRAM shown for INT4 quantization.
| Model | Provider | Category | VRAM (INT4) | .zse Ready |
|---|---|---|---|---|
| Qwen 2.5 7B | Alibaba | Chat/Code | 4.5 GB | ✓ |
| Qwen 2.5 32B | Alibaba | Chat/Code | 19 GB | ✓ |
| Mistral 7B v0.3 | Mistral AI | Chat | 4.5 GB | ✓ |
| DeepSeek Coder 6.7B | DeepSeek | Code | 4 GB | ✓ |
| Llama 3.2 3B | Meta | Chat | 2 GB | — |
| Gemma 2 9B | Reasoning | 5.5 GB | — | |
| Phi-3 Mini | Microsoft | Reasoning | 2.4 GB | — |
| TinyLlama 1.1B | TinyLlama | Testing | 0.7 GB | — |
Simple, Powerful API
Start serving models with just a few lines
# Install ZSE
$ pip install zllm-zse
# Convert to optimized .zse format (11× faster loading)
$ zse quantize Qwen/Qwen2.5-7B-Instruct -o ./model.zse
# Serve your model with instant cold starts
$ zse serve ./model.zse --port 8000
# OpenAI-compatible API is ready!
$ curl localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"model":"default","messages":[{"role":"user","content":"Hello!"}]}'Ready to Try ZSE?
Get memory-efficient LLM inference with fast cold starts. Install and start serving in minutes.