Core Concepts

Architecture

Deep dive into ZSE's internal architecture, component design, and how the pieces fit together.

Overview

ZSE is designed with one goal: minimize time-to-first-token while maintaining high throughput. The architecture consists of five core components that work together to achieve this.

zQuantize
.zse File
zServe
zInfer
zStream
+
zKV

Core Components

zQuantize

The quantization engine that converts models to the optimized .zse format.

  • INT4, INT8, NF4 quantization schemes
  • Per-tensor scaling for minimal quality loss
  • Optimized memory layout for sequential loading
  • Preserves tokenizer and model config

zServe

OpenAI-compatible HTTP server for production deployments.

  • Full OpenAI API compatibility
  • Server-Sent Events streaming
  • Batch processing support
  • API key authentication

zInfer

Low-level inference engine powering both CLI and API.

  • Direct tensor operations
  • Async token generation
  • Multi-GPU support
  • Dynamic batching

zStream

Layer streaming for running models larger than available VRAM.

  • GPU ↔ CPU layer offloading
  • Automatic memory management
  • Run 70B models on 24GB GPUs
  • Minimal latency overhead

zKV

Quantized KV cache for 4× memory savings during generation.

  • 8-bit KV cache quantization
  • Paged attention support
  • Longer context windows
  • Sub-1% quality impact

Inference Flow

When a request comes in, here's what happens:

1

Request Received

zServe validates the request and adds it to the batch queue

2

Tokenization

Input text is converted to token IDs using the model's tokenizer

3

Forward Pass

zInfer runs the model layers, storing KV cache with zKV

4

Token Sampling

Apply temperature, top-p, and sample next token

5

Streaming Response

Tokens are decoded and streamed back via SSE

Memory Architecture

ZSE uses a three-tier memory hierarchy:

TierStorageContentAccess Speed
HotGPU VRAMActive layers, KV cache~1 TB/s
WarmSystem RAMOffloaded layers (zStream)~50 GB/s
ColdDisk (.zse)Model weights~3 GB/s (NVMe)
The .zse format is designed for memory-mapped loading, which means the OS kernel efficiently pages weights from disk as needed.