Multi-GPU Support
Run larger models by distributing weights across multiple GPUs.
Overview
ZSE supports multi-GPU inference, allowing you to run models that don't fit on a single GPU by automatically sharding layers across available devices.
Auto Detection
Automatically detects available GPUs and VRAM
Model Sharding
Distributes layers evenly across GPUs
VRAM Balancing
Configurable memory limits per device
- Automatic GPU count and memory detection
- HuggingFace accelerate integration
- Even layer distribution across GPUs
- Configurable max_memory per GPU
- Support for mixed GPU configurations
How It Works
Multi-GPU inference works by distributing model layers across available GPUs. ZSE uses HuggingFace's device_map="auto" under the hood for automatic layer distribution.
text
Model: Qwen 32B (64 layers)Available: 2x A10G (24GB each) Layer Distribution:┌─────────────┐ ┌─────────────┐│ GPU 0 │ │ GPU 1 ││ │ │ ││ Layers 0-31 │ │ Layers 32-63││ │ │ ││ ~12GB VRAM │ │ ~12GB VRAM │└─────────────┘ └─────────────┘ Forward Pass:Input → GPU 0 → Transfer → GPU 1 → OutputInter-GPU communication adds some latency, but enables running models that wouldn't fit on a single GPU. The overhead is typically 5-15%.
Usage
Python API
python
from zse.engine.orchestrator import IntelligenceOrchestrator # Auto-detect and use all available GPUsorch = IntelligenceOrchestrator.multi_gpu("Qwen/Qwen2.5-Coder-7B-Instruct")orch.load() # Model automatically sharded across GPUs # Or specify which GPUs to useorch = IntelligenceOrchestrator.multi_gpu("model_name", gpu_ids=[0, 1]) # Check GPU infoinfo = IntelligenceOrchestrator.get_gpu_info()print(f"GPUs: {info['count']}, Total VRAM: {info['total_memory'] / 1e9:.1f} GB")CLI
bash
# Auto-detect GPUs and distribute modelzse serve model.zse --multi-gpu # Specify GPU IDszse serve model.zse --gpus 0,1,2 # Set memory limits per GPUzse serve model.zse --multi-gpu --max-memory "0:20GB,1:20GB"Checking GPU Info
python
from zse.engine.orchestrator import IntelligenceOrchestrator info = IntelligenceOrchestrator.get_gpu_info()# Returns:# {# "count": 2,# "devices": [# {"id": 0, "name": "NVIDIA A10G", "memory": 24576000000},# {"id": 1, "name": "NVIDIA A10G", "memory": 24576000000}# ],# "total_memory": 49152000000# }VRAM Distribution
ZSE automatically balances VRAM usage across GPUs. You can also configure maximum memory per GPU for fine-grained control.
python
# Set max memory per GPU (useful for reserving memory for KV cache)orch = IntelligenceOrchestrator.multi_gpu( "model_name", max_memory={ 0: "20GB", # Leave 4GB for KV cache on GPU 0 1: "22GB", # Leave 2GB for KV cache on GPU 1 })Reserve some VRAM for the KV cache and CUDA kernels. Using 100% of available memory will cause out-of-memory errors during inference.
Benchmarks
Verified on Modal with 2x A10G GPUs:
| Test | Result |
|---|---|
| GPU Detection | 2 GPUs detected ✓ |
| Model Load (Qwen 7B FP16) | 80s ✓ |
| VRAM Distribution | GPU 0: 6.22 GB, GPU 1: 7.96 GB ✓ |
| Generation Speed | 100 tokens @ 15.0 tok/s ✓ |
Multi-GPU is most beneficial for models larger than 30B parameters, where the memory savings outweigh the inter-GPU communication overhead.