Benchmarks

Real measurements on A100-80GB. No inflated claims, just verified data.

All tests conducted February 2026 on Modal infrastructure. Results may vary based on hardware.

Cold Start Performance

Time from process start to first token generation. Critical for serverless and auto-scaling.

Qwen 2.5 Coder 7B (A100-80GB)

MethodCold StartSpeedupNotes
bitsandbytes NF4 (first run)216.7sDownloads + quantizes
bitsandbytes NF4 (warm cache)45.4sbaselineWeights cached on disk
.zse pre-quantized3.9s11.6×Full cold start
bitsandbytes
45.4s
.zse format
3.9s

Qwen 2.5 Coder 32B (A100-80GB)

MethodCold StartVRAMSpeedup
bitsandbytes NF4120.0s19.25 GBbaseline
.zse pre-quantized21.4s35.39 GB5.6×

Note: 32B .zse requires 35+ GB VRAM. Use NF4 (19.3 GB) on GPUs with less than 36 GB.

Methodology

Hardware

  • • NVIDIA A100-80GB (Primary)
  • • NVIDIA A10G (24GB, Modal cloud)
  • • CPU: AMD EPYC / Intel Xeon
  • • CUDA 12.1+, PyTorch 2.1+

Test Conditions

  • • Cold start: Fresh process, no cached weights
  • • Warm cache: Model weights on disk, GPU free
  • • Memory: PyTorch memory profiler
  • • Throughput: Average of 5 runs, 256 output tokens

Run Your Own Benchmarks

Reproduce these results on your hardware with our benchmark suite.