Core Concepts

Memory Management

Understanding and optimizing memory usage for model inference.

Overview

Memory management is critical for running large language models. ZSE provides multiple strategies to maximize model size while minimizing resource usage.

4-bit Weights

4x reduction in model memory

4-bit KV Cache

4x longer context windows

Memory Tiers

GPU → CPU → Disk overflow

GPU Memory

GPU memory is used for three main components:

text
Total GPU Memory
├── Model Weights (~60-80%)
│ └── Quantized weights + embedding layers
├── KV Cache (~15-30%)
│ └── Attention key-value pairs for context
├── Activations (~5-10%)
│ └── Intermediate computations during inference
└── CUDA Overhead (~2-5%)
└── CUDA context, kernels, buffers

Memory Estimation

Estimate memory requirements:

text
Model Memory = (Parameters × Bits per Param) / 8
Examples (4-bit quantized):
7B params × 4 bits / 8 = 3.5 GB
14B params × 4 bits / 8 = 7.0 GB
70B params × 4 bits / 8 = 35 GB
KV Cache Memory = 2 × Layers × Hidden × Context × Bytes per KV
Example (7B model, 8K context, FP16 KV):
2 × 32 × 4096 × 8192 × 2 = 4.3 GB

Checking Memory

bash
# Check available GPU memory
zse hardware
# Check model memory requirements
zse info model.zse
# Monitor during inference
watch -n1 nvidia-smi
python
from zllm_zse import ZSE
model = ZSE("qwen-7b.zse")
# Get memory info
mem = model.memory_info()
print(f"Model: {mem['model_size'] / 1e9:.1f} GB")
print(f"KV Cache: {mem['kv_cache_size'] / 1e9:.1f} GB")
print(f"Free GPU: {mem['gpu_free'] / 1e9:.1f} GB")

Model Sizing

Choose the right model size for your hardware:

GPU VRAMMax Model (NF4)Context (FP16 KV)
8 GB7B4K
12 GB14B8K
16 GB14B16K
24 GB34B32K
48 GB70B32K
80 GB70B128K
With 4-bit KV cache, you can double the context length in the table above.

Memory Tiers

ZSE supports tiered memory for running models larger than GPU memory:

text
┌─────────────────────────────────────────────────────────┐
│ Tier 1: GPU Memory (fastest) │
│ - Model weights (always) │
│ - Active KV cache │
│ - Current activations │
├─────────────────────────────────────────────────────────┤
│ Tier 2: CPU Memory (slower) │
│ - Overflow KV cache │
│ - Inactive model layers (offloading) │
├─────────────────────────────────────────────────────────┤
│ Tier 3: Disk (slowest) │
│ - Cold KV cache │
│ - Very long context overflow │
└─────────────────────────────────────────────────────────┘

CPU Offloading

bash
# Offload some layers to CPU
zse serve model.zse --offload-layers 8
# Auto-detect optimal offload
zse serve model.zse --offload auto
CPU offloading reduces throughput by 2-5x. Use only when GPU memory is insufficient.

Disk-Based KV Cache

zse.yaml
kv_cache:
tiers:
- type: gpu
size: auto # Use available GPU memory
- type: cpu
size: 32GB # Overflow to system RAM
- type: disk
path: /tmp/zse_cache
size: 100GB # For very long contexts

Optimization Tips

  • Use NF4 quantization for best quality/size ratio
  • Enable 4-bit KV cache for long contexts
  • Set max_context to your actual needs
  • Use prompt caching for repeated prefixes
  • Monitor with nvidia-smi during development

Handling OOM Errors

bash
# If you get CUDA OOM errors:
# 1. Reduce context length
zse serve model.zse --max-context 2048
# 2. Enable KV cache compression
zse serve model.zse --kv-quant int4
# 3. Reduce batch size
zse serve model.zse --max-batch 8
# 4. Try a smaller model
zse convert Qwen/Qwen2.5-3B-Instruct -o qwen-3b.zse

Memory Benchmarking

bash
# Benchmark memory usage
zse benchmark model.zse --metric memory
# Profile memory over time
zse serve model.zse --profile-memory --profile-output memory.json