Architecture
Deep dive into ZSE's internal architecture, component design, and how the pieces fit together.
Overview
ZSE is designed with one goal: minimize time-to-first-token while maintaining high throughput. The architecture consists of five core components that work together to achieve this.
Core Components
zQuantize
The quantization engine that converts models to the optimized .zse format.
- INT4, INT8, NF4 quantization schemes
- Per-tensor scaling for minimal quality loss
- Optimized memory layout for sequential loading
- Preserves tokenizer and model config
zServe
OpenAI-compatible HTTP server for production deployments.
- Full OpenAI API compatibility
- Server-Sent Events streaming
- Batch processing support
- API key authentication
zInfer
Low-level inference engine powering both CLI and API.
- Direct tensor operations
- Async token generation
- Multi-GPU support
- Dynamic batching
zStream
Layer streaming for running models larger than available VRAM.
- GPU ↔ CPU layer offloading
- Automatic memory management
- Run 70B models on 24GB GPUs
- Minimal latency overhead
zKV
Quantized KV cache for 4× memory savings during generation.
- 8-bit KV cache quantization
- Paged attention support
- Longer context windows
- Sub-1% quality impact
Inference Flow
When a request comes in, here's what happens:
Request Received
zServe validates the request and adds it to the batch queue
Tokenization
Input text is converted to token IDs using the model's tokenizer
Forward Pass
zInfer runs the model layers, storing KV cache with zKV
Token Sampling
Apply temperature, top-p, and sample next token
Streaming Response
Tokens are decoded and streamed back via SSE
Memory Architecture
ZSE uses a three-tier memory hierarchy:
| Tier | Storage | Content | Access Speed |
|---|---|---|---|
| Hot | GPU VRAM | Active layers, KV cache | ~1 TB/s |
| Warm | System RAM | Offloaded layers (zStream) | ~50 GB/s |
| Cold | Disk (.zse) | Model weights | ~3 GB/s (NVMe) |
.zse format is designed for memory-mapped loading, which means the OS kernel efficiently pages weights from disk as needed.