Back to blog
tutorialmemoryadvanced

Running 70B Models on a 24GB GPU with ZSE

ZSE TeamFebruary 23, 20267 min read

Running 70B Models on a 24GB GPU

Yes, you can run 70B parameter models on a single RTX 4090. Here's how.

The Challenge

A 70B model in FP16 needs ~140GB VRAM. Even with 4-bit quantization, that's ~35GB.

ZSE's Solution

Combine multiple techniques:

1. NF4 Quantization

zse convert meta-llama/Llama-3.1-70B-Instruct -o llama-70b.zse --quant nf4

Brings weights down to ~35GB.

2. CPU Offloading

zse serve llama-70b.zse --offload-layers 20

Keep attention-heavy layers on GPU, offload FFN layers to CPU RAM.

3. 4-bit KV Cache

zse serve llama-70b.zse --kv-quant int4 --max-context 4096

Reduces KV cache memory by 4×.

Full Command

zse serve llama-70b.zse \

--offload-layers 20 \

--kv-quant int4 \

--max-context 4096 \

--max-batch 4

Performance Expectations

GPU
Throughput
Latency
RTX 4090 24GB
~15 tok/s
~200ms TTFT
RTX 3090 24GB
~10 tok/s
~300ms TTFT
A100 80GB
~45 tok/s
~80ms TTFT

Tips for Best Results

1. **Use SSD storage** - NVMe makes offloading faster

2. **Allocate enough RAM** - 64GB system RAM recommended

3. **Reduce batch size** - Trade throughput for memory

4. **Limit context length** - Shorter contexts use less KV cache

Now you can run frontier models locally!

Related Posts