Core Concepts

Memory Management

Understanding and optimizing memory usage for model inference.

Overview

Memory management is critical for running large language models. ZSE provides multiple strategies to maximize model size while minimizing resource usage.

4-bit Weights

4x reduction in model memory

4-bit KV Cache

4x longer context windows

Memory Tiers

GPU → CPU → Disk overflow

GPU Memory

GPU memory is used for three main components:

text

Total GPU Memory
├── Model Weights (~60-80%)
│   └── Quantized weights + embedding layers
├── KV Cache (~15-30%)
│   └── Attention key-value pairs for context
├── Activations (~5-10%)
│   └── Intermediate computations during inference
└── CUDA Overhead (~2-5%)
    └── CUDA context, kernels, buffers

Memory Estimation

Estimate memory requirements:

text

Model Memory = (Parameters × Bits per Param) / 8
 
Examples (4-bit quantized):
  7B params × 4 bits / 8 = 3.5 GB
 14B params × 4 bits / 8 = 7.0 GB
 70B params × 4 bits / 8 = 35 GB
 
KV Cache Memory = 2 × Layers × Hidden × Context × Bytes per KV
 
Example (7B model, 8K context, FP16 KV):
  2 × 32 × 4096 × 8192 × 2 = 4.3 GB

Checking Memory

bash

# Check available GPU memory
zse hardware
 
# Check model memory requirements
zse info model.zse
 
# Monitor during inference
watch -n1 nvidia-smi

python

from zllm_zse import ZSE
 
model = ZSE("qwen-7b.zse")
 
# Get memory info
mem = model.memory_info()
print(f"Model: {mem['model_size'] / 1e9:.1f} GB")
print(f"KV Cache: {mem['kv_cache_size'] / 1e9:.1f} GB")
print(f"Free GPU: {mem['gpu_free'] / 1e9:.1f} GB")

Model Sizing

Choose the right model size for your hardware:

GPU VRAM	Max Model (NF4)	Context (FP16 KV)
8 GB	7B	4K
12 GB	14B	8K
16 GB	14B	16K
24 GB	34B	32K
48 GB	70B	32K
80 GB	70B	128K

With 4-bit KV cache, you can double the context length in the table above.

Memory Tiers

ZSE supports tiered memory for running models larger than GPU memory:

text

┌─────────────────────────────────────────────────────────┐
│ Tier 1: GPU Memory (fastest)                            │
│   - Model weights (always)                              │
│   - Active KV cache                                     │
│   - Current activations                                 │
├─────────────────────────────────────────────────────────┤
│ Tier 2: CPU Memory (slower)                             │
│   - Overflow KV cache                                   │
│   - Inactive model layers (offloading)                  │
├─────────────────────────────────────────────────────────┤
│ Tier 3: Disk (slowest)                                  │
│   - Cold KV cache                                       │
│   - Very long context overflow                          │
└─────────────────────────────────────────────────────────┘

CPU Offloading

bash

# Offload some layers to CPU
zse serve model.zse --offload-layers 8
 
# Auto-detect optimal offload
zse serve model.zse --offload auto

CPU offloading reduces throughput by 2-5x. Use only when GPU memory is insufficient.

Disk-Based KV Cache

zse.yaml

kv_cache:
  tiers:
    - type: gpu
      size: auto        # Use available GPU memory
      
    - type: cpu
      size: 32GB        # Overflow to system RAM
      
    - type: disk
      path: /tmp/zse_cache
      size: 100GB       # For very long contexts

Optimization Tips

Use NF4 quantization for best quality/size ratio
Enable 4-bit KV cache for long contexts
Set max_context to your actual needs
Use prompt caching for repeated prefixes
Monitor with nvidia-smi during development

Handling OOM Errors

bash

# If you get CUDA OOM errors:
 
# 1. Reduce context length
zse serve model.zse --max-context 2048
 
# 2. Enable KV cache compression
zse serve model.zse --kv-quant int4
 
# 3. Reduce batch size
zse serve model.zse --max-batch 8
 
# 4. Try a smaller model
zse convert Qwen/Qwen2.5-3B-Instruct -o qwen-3b.zse

Memory Benchmarking

bash

# Benchmark memory usage
zse benchmark model.zse --metric memory
 
# Profile memory over time
zse serve model.zse --profile-memory --profile-output memory.json

← Quantization

zQuantize →