Core Concepts

Architecture

Deep dive into ZSE's internal architecture, component design, and how the pieces fit together.

Overview

ZSE is designed with one goal: minimize time-to-first-token while maintaining high throughput. The architecture consists of five core components that work together to achieve this.

zQuantize

→

.zse File

zServe

↔

zInfer

zStream

zKV

Core Components

zQuantize

The quantization engine that converts models to the optimized .zse format.

INT4, INT8, NF4 quantization schemes
Per-tensor scaling for minimal quality loss
Optimized memory layout for sequential loading
Preserves tokenizer and model config

zServe

OpenAI-compatible HTTP server for production deployments.

Full OpenAI API compatibility
Server-Sent Events streaming
Batch processing support
API key authentication

zInfer

Low-level inference engine powering both CLI and API.

Direct tensor operations
Async token generation
Multi-GPU support
Dynamic batching

zStream

Layer streaming for running models larger than available VRAM.

GPU ↔ CPU layer offloading
Automatic memory management
Run 70B models on 24GB GPUs
Minimal latency overhead

zKV

Quantized KV cache for 4× memory savings during generation.

8-bit KV cache quantization
Paged attention support
Longer context windows
Sub-1% quality impact

Inference Flow

When a request comes in, here's what happens:

Request Received

zServe validates the request and adds it to the batch queue

Tokenization

Input text is converted to token IDs using the model's tokenizer

Forward Pass

zInfer runs the model layers, storing KV cache with zKV

Token Sampling

Apply temperature, top-p, and sample next token

Streaming Response

Tokens are decoded and streamed back via SSE

Memory Architecture

ZSE uses a three-tier memory hierarchy:

Tier	Storage	Content	Access Speed
Hot	GPU VRAM	Active layers, KV cache	~1 TB/s
Warm	System RAM	Offloaded layers (zStream)	~50 GB/s
Cold	Disk (.zse)	Model weights	~3 GB/s (NVMe)

The .zse format is designed for memory-mapped loading, which means the OS kernel efficiently pages weights from disk as needed.

← First Model

Model Formats →