Running 70B Models on a 24GB GPU

Yes, you can run 70B parameter models on a single RTX 4090. Here's how.

The Challenge

A 70B model in FP16 needs ~140GB VRAM. Even with 4-bit quantization, that's ~35GB.

ZSE's Solution

Combine multiple techniques:

1. NF4 Quantization

zse convert meta-llama/Llama-3.1-70B-Instruct -o llama-70b.zse --quant nf4

Brings weights down to ~35GB.

2. CPU Offloading

zse serve llama-70b.zse --offload-layers 20

Keep attention-heavy layers on GPU, offload FFN layers to CPU RAM.

3. 4-bit KV Cache

zse serve llama-70b.zse --kv-quant int4 --max-context 4096

Reduces KV cache memory by 4×.

Full Command

zse serve llama-70b.zse \

--offload-layers 20 \

--kv-quant int4 \

--max-context 4096 \

--max-batch 4

Performance Expectations

GPU

Throughput

Latency

RTX 4090 24GB

~15 tok/s

~200ms TTFT

RTX 3090 24GB

~10 tok/s

~300ms TTFT

A100 80GB

~45 tok/s

~80ms TTFT

Tips for Best Results

1. **Use SSD storage** - NVMe makes offloading faster

2. **Allocate enough RAM** - 64GB system RAM recommended

3. **Reduce batch size** - Trade throughput for memory

4. **Limit context length** - Shorter contexts use less KV cache

Now you can run frontier models locally!

Running 70B Models on a 24GB GPU with ZSE

Running 70B Models on a 24GB GPU

The Challenge

ZSE's Solution

1. NF4 Quantization

2. CPU Offloading

3. 4-bit KV Cache

Full Command

Performance Expectations

Tips for Best Results

Related Posts

Complete Guide: Running Your First Model with ZSE

Building a Local RAG Chatbot with ZSE

Streaming Responses with ZSE: Real-time Token Generation