Feature

Multi-GPU Support

Run larger models by distributing weights across multiple GPUs.

Overview

ZSE supports multi-GPU inference, allowing you to run models that don't fit on a single GPU by automatically sharding layers across available devices.

Auto Detection

Automatically detects available GPUs and VRAM

Model Sharding

Distributes layers evenly across GPUs

VRAM Balancing

Configurable memory limits per device

  • Automatic GPU count and memory detection
  • HuggingFace accelerate integration
  • Even layer distribution across GPUs
  • Configurable max_memory per GPU
  • Support for mixed GPU configurations

How It Works

Multi-GPU inference works by distributing model layers across available GPUs. ZSE uses HuggingFace's device_map="auto" under the hood for automatic layer distribution.

text
Model: Qwen 32B (64 layers)
Available: 2x A10G (24GB each)
Layer Distribution:
┌─────────────┐ ┌─────────────┐
│ GPU 0 │ │ GPU 1 │
│ │ │ │
│ Layers 0-31 │ │ Layers 32-63│
│ │ │ │
│ ~12GB VRAM │ │ ~12GB VRAM │
└─────────────┘ └─────────────┘
Forward Pass:
Input → GPU 0 → Transfer → GPU 1 → Output
Inter-GPU communication adds some latency, but enables running models that wouldn't fit on a single GPU. The overhead is typically 5-15%.

Usage

Python API

python
from zse.engine.orchestrator import IntelligenceOrchestrator
# Auto-detect and use all available GPUs
orch = IntelligenceOrchestrator.multi_gpu("Qwen/Qwen2.5-Coder-7B-Instruct")
orch.load() # Model automatically sharded across GPUs
# Or specify which GPUs to use
orch = IntelligenceOrchestrator.multi_gpu("model_name", gpu_ids=[0, 1])
# Check GPU info
info = IntelligenceOrchestrator.get_gpu_info()
print(f"GPUs: {info['count']}, Total VRAM: {info['total_memory'] / 1e9:.1f} GB")

CLI

bash
# Auto-detect GPUs and distribute model
zse serve model.zse --multi-gpu
# Specify GPU IDs
zse serve model.zse --gpus 0,1,2
# Set memory limits per GPU
zse serve model.zse --multi-gpu --max-memory "0:20GB,1:20GB"

Checking GPU Info

python
from zse.engine.orchestrator import IntelligenceOrchestrator
info = IntelligenceOrchestrator.get_gpu_info()
# Returns:
# {
# "count": 2,
# "devices": [
# {"id": 0, "name": "NVIDIA A10G", "memory": 24576000000},
# {"id": 1, "name": "NVIDIA A10G", "memory": 24576000000}
# ],
# "total_memory": 49152000000
# }

VRAM Distribution

ZSE automatically balances VRAM usage across GPUs. You can also configure maximum memory per GPU for fine-grained control.

python
# Set max memory per GPU (useful for reserving memory for KV cache)
orch = IntelligenceOrchestrator.multi_gpu(
"model_name",
max_memory={
0: "20GB", # Leave 4GB for KV cache on GPU 0
1: "22GB", # Leave 2GB for KV cache on GPU 1
}
)
Reserve some VRAM for the KV cache and CUDA kernels. Using 100% of available memory will cause out-of-memory errors during inference.

Benchmarks

Verified on Modal with 2x A10G GPUs:

TestResult
GPU Detection2 GPUs detected ✓
Model Load (Qwen 7B FP16)80s ✓
VRAM DistributionGPU 0: 6.22 GB, GPU 1: 7.96 GB ✓
Generation Speed100 tokens @ 15.0 tok/s ✓
Multi-GPU is most beneficial for models larger than 30B parameters, where the memory savings outweigh the inter-GPU communication overhead.