Benchmarks
Real measurements on A100-80GB. No inflated claims, just verified data.
All tests conducted February 2026 on Modal infrastructure. Results may vary based on hardware.
Cold Start Performance
Time from process start to first token generation. Critical for serverless and auto-scaling.
Qwen 2.5 Coder 7B (A100-80GB)
| Method | Cold Start | Speedup | Notes |
|---|---|---|---|
| bitsandbytes NF4 (first run) | 216.7s | — | Downloads + quantizes |
| bitsandbytes NF4 (warm cache) | 45.4s | baseline | Weights cached on disk |
| .zse pre-quantized | 3.9s | 11.6× | Full cold start |
bitsandbytes
45.4s
.zse format
3.9s
Qwen 2.5 Coder 32B (A100-80GB)
| Method | Cold Start | VRAM | Speedup |
|---|---|---|---|
| bitsandbytes NF4 | 120.0s | 19.25 GB | baseline |
| .zse pre-quantized | 21.4s | 35.39 GB | 5.6× |
Note: 32B .zse requires 35+ GB VRAM. Use NF4 (19.3 GB) on GPUs with less than 36 GB.
Methodology
Hardware
- • NVIDIA A100-80GB (Primary)
- • NVIDIA A10G (24GB, Modal cloud)
- • CPU: AMD EPYC / Intel Xeon
- • CUDA 12.1+, PyTorch 2.1+
Test Conditions
- • Cold start: Fresh process, no cached weights
- • Warm cache: Model weights on disk, GPU free
- • Memory: PyTorch memory profiler
- • Throughput: Average of 5 runs, 256 output tokens
Run Your Own Benchmarks
Reproduce these results on your hardware with our benchmark suite.