Running 70B Models on a 24GB GPU with ZSE
Running 70B Models on a 24GB GPU
Yes, you can run 70B parameter models on a single RTX 4090. Here's how.
The Challenge
A 70B model in FP16 needs ~140GB VRAM. Even with 4-bit quantization, that's ~35GB.
ZSE's Solution
Combine multiple techniques:
1. NF4 Quantization
zse convert meta-llama/Llama-3.1-70B-Instruct -o llama-70b.zse --quant nf4
Brings weights down to ~35GB.
2. CPU Offloading
zse serve llama-70b.zse --offload-layers 20
Keep attention-heavy layers on GPU, offload FFN layers to CPU RAM.
3. 4-bit KV Cache
zse serve llama-70b.zse --kv-quant int4 --max-context 4096
Reduces KV cache memory by 4×.
Full Command
zse serve llama-70b.zse \
--offload-layers 20 \
--kv-quant int4 \
--max-context 4096 \
--max-batch 4
Performance Expectations
Tips for Best Results
1. **Use SSD storage** - NVMe makes offloading faster
2. **Allocate enough RAM** - 64GB system RAM recommended
3. **Reduce batch size** - Trade throughput for memory
4. **Limit context length** - Shorter contexts use less KV cache
Now you can run frontier models locally!
Related Posts
Complete Guide: Running Your First Model with ZSE
Step-by-step tutorial to install ZSE, convert a model, and start generating text in under 5 minutes.
Building a Local RAG Chatbot with ZSE
Create a retrieval-augmented generation chatbot that answers questions about your documents.
Streaming Responses with ZSE: Real-time Token Generation
Implement real-time streaming for chat applications with minimal time-to-first-token.