Feature

Multi-GPU Support

Run larger models by distributing weights across multiple GPUs.

Overview

ZSE supports multi-GPU inference, allowing you to run models that don't fit on a single GPU by automatically sharding layers across available devices.

Auto Detection

Automatically detects available GPUs and VRAM

Model Sharding

Distributes layers evenly across GPUs

VRAM Balancing

Configurable memory limits per device

Automatic GPU count and memory detection
HuggingFace accelerate integration
Even layer distribution across GPUs
Configurable max_memory per GPU
Support for mixed GPU configurations

How It Works

Multi-GPU inference works by distributing model layers across available GPUs. ZSE uses HuggingFace's device_map="auto" under the hood for automatic layer distribution.

text

Model: Qwen 32B (64 layers)
Available: 2x A10G (24GB each)
 
Layer Distribution:
┌─────────────┐    ┌─────────────┐
│   GPU 0     │    │   GPU 1     │
│             │    │             │
│ Layers 0-31 │    │ Layers 32-63│
│             │    │             │
│ ~12GB VRAM  │    │ ~12GB VRAM  │
└─────────────┘    └─────────────┘
 
Forward Pass:
Input → GPU 0 → Transfer → GPU 1 → Output

Inter-GPU communication adds some latency, but enables running models that wouldn't fit on a single GPU. The overhead is typically 5-15%.

Usage

Python API

python

from zse.engine.orchestrator import IntelligenceOrchestrator
 
# Auto-detect and use all available GPUs
orch = IntelligenceOrchestrator.multi_gpu("Qwen/Qwen2.5-Coder-7B-Instruct")
orch.load()  # Model automatically sharded across GPUs
 
# Or specify which GPUs to use
orch = IntelligenceOrchestrator.multi_gpu("model_name", gpu_ids=[0, 1])
 
# Check GPU info
info = IntelligenceOrchestrator.get_gpu_info()
print(f"GPUs: {info['count']}, Total VRAM: {info['total_memory'] / 1e9:.1f} GB")

CLI

bash

# Auto-detect GPUs and distribute model
zse serve model.zse --multi-gpu
 
# Specify GPU IDs
zse serve model.zse --gpus 0,1,2
 
# Set memory limits per GPU
zse serve model.zse --multi-gpu --max-memory "0:20GB,1:20GB"

Checking GPU Info

python

from zse.engine.orchestrator import IntelligenceOrchestrator
 
info = IntelligenceOrchestrator.get_gpu_info()
# Returns:
# {
#   "count": 2,
#   "devices": [
#     {"id": 0, "name": "NVIDIA A10G", "memory": 24576000000},
#     {"id": 1, "name": "NVIDIA A10G", "memory": 24576000000}
#   ],
#   "total_memory": 49152000000
# }

VRAM Distribution

ZSE automatically balances VRAM usage across GPUs. You can also configure maximum memory per GPU for fine-grained control.

python

# Set max memory per GPU (useful for reserving memory for KV cache)
orch = IntelligenceOrchestrator.multi_gpu(
    "model_name",
    max_memory={
        0: "20GB",  # Leave 4GB for KV cache on GPU 0
        1: "22GB",  # Leave 2GB for KV cache on GPU 1
    }
)

Reserve some VRAM for the KV cache and CUDA kernels. Using 100% of available memory will cause out-of-memory errors during inference.

Benchmarks

Verified on Modal with 2x A10G GPUs:

Test	Result
GPU Detection	2 GPUs detected ✓
Model Load (Qwen 7B FP16)	80s ✓
VRAM Distribution	GPU 0: 6.22 GB, GPU 1: 7.96 GB ✓
Generation Speed	100 tokens @ 15.0 tok/s ✓

Multi-GPU is most beneficial for models larger than 30B parameters, where the memory savings outweigh the inter-GPU communication overhead.

← zKV

GGUF Compatibility →