Benchmarking Your ZSE Setup

Learn how to measure real performance metrics for your hardware.

Built-in Benchmark Command

zse benchmark qwen-7b.zse

Output:

┌────────────────────────────────────────────────┐

│ Benchmark Results: qwen-7b.zse │

├────────────────────────────────────────────────┤

│ Cold Start: 3.9s │

│ Throughput: 87.3 tok/s │

│ Time-to-First: 52ms │

│ Latency (p50): 11.4ms/tok │

│ Latency (p99): 18.2ms/tok │

│ GPU Memory: 5.2 GB │

└────────────────────────────────────────────────┘

Specific Benchmarks

Cold Start Only

zse benchmark model.zse --metric cold-start --runs 5

Throughput Test

zse benchmark model.zse --metric throughput \

--prompt-length 512 \

--output-length 256 \

--batch-sizes 1,4,8,16

Memory Profiling

zse benchmark model.zse --metric memory \

--context-lengths 1024,4096,8192,16384

Compare Configurations

Compare quantization types

zse benchmark model-nf4.zse model-int4.zse model-int8.zse

Compare context lengths

zse benchmark model.zse --sweep max-context 1024:16384:2x

Python Benchmarking

from zllm_zse import ZSE, benchmark

model = ZSE("qwen-7b.zse")

results = benchmark(

model,

prompts=["Explain quantum computing" * 10 for _ in range(100)],

max_tokens=256

)

print(f"Mean throughput: {results.throughput_mean:.1f} tok/s")

print(f"p99 latency: {results.latency_p99:.1f} ms/tok")

Hardware-Specific Expectations

GPU

7B Throughput

7B Cold Start

RTX 3060 12GB

~45 tok/s

~4.5s

RTX 4070 12GB

~80 tok/s

~4.0s

RTX 4090 24GB

~120 tok/s

~3.5s

A100 80GB

~180 tok/s

~3.2s

Your mileage may vary based on PCIe bandwidth, CPU, and storage speed.

Benchmarking Your ZSE Setup: Measuring Real Performance