Feature

zKV

Intelligent KV cache management for optimal memory usage and prompt caching.

Overview

zKV manages the key-value cache used during inference, enabling prompt caching, memory optimization, and long-context conversations.

Prompt Caching

Reuse computations for repeated prompts

4-bit KV

Compressed cache for long contexts

Persistence

Save/load KV state to disk

  • Automatic prompt prefix caching
  • 4-bit KV cache compression
  • Paged attention for dynamic memory
  • Disk-backed KV for very long contexts
  • Multi-user cache isolation

How It Works

The KV (key-value) cache stores intermediate computations from the attention mechanism. Without caching, these must be recomputed for every token generated.

text
Without KV Cache:
┌──────────────────────────────────────────────────────┐
│ Prompt: "The quick brown fox" │
│ → Compute attention for all tokens (4 forward passes)│
│ → Generate "jumps" (recompute all + new token) │
│ → Generate "over" (recompute all + new tokens) │
│ → Total: O(n²) computations │
└──────────────────────────────────────────────────────┘
With KV Cache:
┌──────────────────────────────────────────────────────┐
│ Prompt: "The quick brown fox" │
│ → Compute attention, STORE in KV cache │
│ → Generate "jumps" (reuse cache + 1 new computation) │
│ → Generate "over" (reuse cache + 1 new computation) │
│ → Total: O(n) computations │
└──────────────────────────────────────────────────────┘
KV caching typically provides 10-100x speedup for generation after the initial prompt processing.

Configuration

Cache Size

bash
# Set maximum context length (determines cache size)
zse serve model.zse --max-context 8192
# Set maximum cache memory
zse serve model.zse --kv-cache-memory 4GB
# Dynamic cache sizing
zse serve model.zse --kv-cache dynamic

Memory requirements per context length (7B model):

ContextFP16 KV4-bit KV
4,0961.0 GB0.3 GB
8,1922.0 GB0.5 GB
32,7688.0 GB2.0 GB
131,07232 GB8.0 GB

Prompt Caching

ZSE automatically caches common prompt prefixes:

python
from zllm_zse import ZSE
model = ZSE("qwen-7b.zse")
# First request - full computation
response1 = model.chat([
{"role": "system", "content": "You are a helpful assistant..."}, # Cached
{"role": "user", "content": "Hello!"}
]) # ~500ms
# Second request - reuses system prompt cache
response2 = model.chat([
{"role": "system", "content": "You are a helpful assistant..."}, # From cache!
{"role": "user", "content": "How are you?"}
]) # ~50ms (10x faster)

Configure prompt caching:

zse.yaml
kv_cache:
prompt_cache: true
prompt_cache_size: 1GB
prompt_cache_ttl: 3600 # seconds
# Cache specific prefixes
prefixes:
- name: "default_system"
content: "You are a helpful assistant..."
preload: true

KV Cache Compression

Enable 4-bit KV cache for longer contexts:

bash
# Enable 4-bit KV cache
zse serve model.zse --kv-quant int4
# 8-bit KV cache (better quality)
zse serve model.zse --kv-quant int8
python
from zllm_zse import ZSE
# Enable quantized KV cache
model = ZSE("qwen-7b.zse", kv_quant="int4")
# Now supports 4x longer contexts!
response = model.chat(
messages=[...],
max_context=131072 # 128K context
)

Persistent Cache

Save and restore KV cache for long-running conversations:

python
from zllm_zse import ZSE
model = ZSE("qwen-7b.zse")
# Start conversation
session = model.create_session()
session.chat([{"role": "user", "content": "My name is Alice"}])
session.chat([{"role": "user", "content": "I live in New York"}])
# Save session state (includes KV cache)
session.save("alice_session.zse")
# Later: restore session
session = model.load_session("alice_session.zse")
response = session.chat([{"role": "user", "content": "What's my name?"}])
# → "Your name is Alice"

Session Management API

python
from zllm_zse import ZSE, Session
model = ZSE("qwen-7b.zse")
# Create named session
session = model.create_session(name="user_123")
# List sessions
sessions = model.list_sessions()
print(sessions) # ['user_123', 'user_456', ...]
# Get session info
info = session.info()
print(info)
# {
# 'name': 'user_123',
# 'context_length': 1024,
# 'cache_size_bytes': 52428800,
# 'created_at': '2024-01-15T10:30:00Z'
# }
# Clear session
session.clear()
# Delete session
model.delete_session("user_123")

Memory Optimization

Paged Attention

Paged attention allocates KV cache in fixed-size blocks, reducing memory fragmentation:

bash
# Enable paged attention
zse serve model.zse --paged-attention
# Configure block size
zse serve model.zse --paged-attention --block-size 16
  • Reduces memory fragmentation
  • Enables dynamic batch sizes
  • More concurrent requests with same memory

Memory Tiers

ZSE can use CPU memory and disk as overflow for KV cache:

zse.yaml
kv_cache:
tiers:
- type: gpu
size: 8GB
priority: 1
- type: cpu
size: 32GB
priority: 2
- type: disk
path: /tmp/zse_kv_cache
size: 100GB
priority: 3
eviction: lru # Evict least-recently-used
CPU and disk tiers add latency. Use GPU memory for latency-sensitive workloads.

Cache Statistics

python
from zllm_zse import ZSE
model = ZSE("qwen-7b.zse")
# Get cache statistics
stats = model.kv_cache_stats()
print(stats)
# {
# 'total_size': 8589934592, # 8 GB
# 'used_size': 2147483648, # 2 GB
# 'prompt_cache_hits': 1542,
# 'prompt_cache_misses': 89,
# 'hit_rate': 0.945,
# 'evictions': 23
# }
# Clear cache
model.clear_kv_cache()
# Clear specific session
model.clear_kv_cache(session="user_123")