Memory Management
Understanding and optimizing memory usage for model inference.
Overview
Memory management is critical for running large language models. ZSE provides multiple strategies to maximize model size while minimizing resource usage.
4-bit Weights
4x reduction in model memory
4-bit KV Cache
4x longer context windows
Memory Tiers
GPU → CPU → Disk overflow
GPU Memory
GPU memory is used for three main components:
text
Total GPU Memory├── Model Weights (~60-80%)│ └── Quantized weights + embedding layers├── KV Cache (~15-30%)│ └── Attention key-value pairs for context├── Activations (~5-10%)│ └── Intermediate computations during inference└── CUDA Overhead (~2-5%) └── CUDA context, kernels, buffersMemory Estimation
Estimate memory requirements:
text
Model Memory = (Parameters × Bits per Param) / 8 Examples (4-bit quantized): 7B params × 4 bits / 8 = 3.5 GB 14B params × 4 bits / 8 = 7.0 GB 70B params × 4 bits / 8 = 35 GB KV Cache Memory = 2 × Layers × Hidden × Context × Bytes per KV Example (7B model, 8K context, FP16 KV): 2 × 32 × 4096 × 8192 × 2 = 4.3 GBChecking Memory
bash
# Check available GPU memoryzse hardware # Check model memory requirementszse info model.zse # Monitor during inferencewatch -n1 nvidia-smipython
from zllm_zse import ZSE model = ZSE("qwen-7b.zse") # Get memory infomem = model.memory_info()print(f"Model: {mem['model_size'] / 1e9:.1f} GB")print(f"KV Cache: {mem['kv_cache_size'] / 1e9:.1f} GB")print(f"Free GPU: {mem['gpu_free'] / 1e9:.1f} GB")Model Sizing
Choose the right model size for your hardware:
| GPU VRAM | Max Model (NF4) | Context (FP16 KV) |
|---|---|---|
| 8 GB | 7B | 4K |
| 12 GB | 14B | 8K |
| 16 GB | 14B | 16K |
| 24 GB | 34B | 32K |
| 48 GB | 70B | 32K |
| 80 GB | 70B | 128K |
With 4-bit KV cache, you can double the context length in the table above.
Memory Tiers
ZSE supports tiered memory for running models larger than GPU memory:
text
┌─────────────────────────────────────────────────────────┐│ Tier 1: GPU Memory (fastest) ││ - Model weights (always) ││ - Active KV cache ││ - Current activations │├─────────────────────────────────────────────────────────┤│ Tier 2: CPU Memory (slower) ││ - Overflow KV cache ││ - Inactive model layers (offloading) │├─────────────────────────────────────────────────────────┤│ Tier 3: Disk (slowest) ││ - Cold KV cache ││ - Very long context overflow │└─────────────────────────────────────────────────────────┘CPU Offloading
bash
# Offload some layers to CPUzse serve model.zse --offload-layers 8 # Auto-detect optimal offloadzse serve model.zse --offload autoCPU offloading reduces throughput by 2-5x. Use only when GPU memory is insufficient.
Disk-Based KV Cache
zse.yaml
kv_cache: tiers: - type: gpu size: auto # Use available GPU memory - type: cpu size: 32GB # Overflow to system RAM - type: disk path: /tmp/zse_cache size: 100GB # For very long contextsOptimization Tips
- Use NF4 quantization for best quality/size ratio
- Enable 4-bit KV cache for long contexts
- Set max_context to your actual needs
- Use prompt caching for repeated prefixes
- Monitor with nvidia-smi during development
Handling OOM Errors
bash
# If you get CUDA OOM errors: # 1. Reduce context lengthzse serve model.zse --max-context 2048 # 2. Enable KV cache compressionzse serve model.zse --kv-quant int4 # 3. Reduce batch sizezse serve model.zse --max-batch 8 # 4. Try a smaller modelzse convert Qwen/Qwen2.5-3B-Instruct -o qwen-3b.zseMemory Benchmarking
bash
# Benchmark memory usagezse benchmark model.zse --metric memory # Profile memory over timezse serve model.zse --profile-memory --profile-output memory.json