zKV
Intelligent KV cache management for optimal memory usage and prompt caching.
Overview
zKV manages the key-value cache used during inference, enabling prompt caching, memory optimization, and long-context conversations.
Prompt Caching
Reuse computations for repeated prompts
4-bit KV
Compressed cache for long contexts
Persistence
Save/load KV state to disk
- Automatic prompt prefix caching
- 4-bit KV cache compression
- Paged attention for dynamic memory
- Disk-backed KV for very long contexts
- Multi-user cache isolation
How It Works
The KV (key-value) cache stores intermediate computations from the attention mechanism. Without caching, these must be recomputed for every token generated.
Without KV Cache:┌──────────────────────────────────────────────────────┐│ Prompt: "The quick brown fox" ││ → Compute attention for all tokens (4 forward passes)││ → Generate "jumps" (recompute all + new token) ││ → Generate "over" (recompute all + new tokens) ││ → Total: O(n²) computations │└──────────────────────────────────────────────────────┘ With KV Cache:┌──────────────────────────────────────────────────────┐│ Prompt: "The quick brown fox" ││ → Compute attention, STORE in KV cache ││ → Generate "jumps" (reuse cache + 1 new computation) ││ → Generate "over" (reuse cache + 1 new computation) ││ → Total: O(n) computations │└──────────────────────────────────────────────────────┘Configuration
Cache Size
# Set maximum context length (determines cache size)zse serve model.zse --max-context 8192 # Set maximum cache memoryzse serve model.zse --kv-cache-memory 4GB # Dynamic cache sizingzse serve model.zse --kv-cache dynamicMemory requirements per context length (7B model):
| Context | FP16 KV | 4-bit KV |
|---|---|---|
| 4,096 | 1.0 GB | 0.3 GB |
| 8,192 | 2.0 GB | 0.5 GB |
| 32,768 | 8.0 GB | 2.0 GB |
| 131,072 | 32 GB | 8.0 GB |
Prompt Caching
ZSE automatically caches common prompt prefixes:
from zllm_zse import ZSE model = ZSE("qwen-7b.zse") # First request - full computationresponse1 = model.chat([ {"role": "system", "content": "You are a helpful assistant..."}, # Cached {"role": "user", "content": "Hello!"}]) # ~500ms # Second request - reuses system prompt cacheresponse2 = model.chat([ {"role": "system", "content": "You are a helpful assistant..."}, # From cache! {"role": "user", "content": "How are you?"}]) # ~50ms (10x faster)Configure prompt caching:
kv_cache: prompt_cache: true prompt_cache_size: 1GB prompt_cache_ttl: 3600 # seconds # Cache specific prefixes prefixes: - name: "default_system" content: "You are a helpful assistant..." preload: trueKV Cache Compression
Enable 4-bit KV cache for longer contexts:
# Enable 4-bit KV cachezse serve model.zse --kv-quant int4 # 8-bit KV cache (better quality)zse serve model.zse --kv-quant int8from zllm_zse import ZSE # Enable quantized KV cachemodel = ZSE("qwen-7b.zse", kv_quant="int4") # Now supports 4x longer contexts!response = model.chat( messages=[...], max_context=131072 # 128K context)Persistent Cache
Save and restore KV cache for long-running conversations:
from zllm_zse import ZSE model = ZSE("qwen-7b.zse") # Start conversationsession = model.create_session()session.chat([{"role": "user", "content": "My name is Alice"}])session.chat([{"role": "user", "content": "I live in New York"}]) # Save session state (includes KV cache)session.save("alice_session.zse") # Later: restore sessionsession = model.load_session("alice_session.zse")response = session.chat([{"role": "user", "content": "What's my name?"}])# → "Your name is Alice"Session Management API
from zllm_zse import ZSE, Session model = ZSE("qwen-7b.zse") # Create named sessionsession = model.create_session(name="user_123") # List sessionssessions = model.list_sessions()print(sessions) # ['user_123', 'user_456', ...] # Get session infoinfo = session.info()print(info)# {# 'name': 'user_123',# 'context_length': 1024,# 'cache_size_bytes': 52428800,# 'created_at': '2024-01-15T10:30:00Z'# } # Clear sessionsession.clear() # Delete sessionmodel.delete_session("user_123")Memory Optimization
Paged Attention
Paged attention allocates KV cache in fixed-size blocks, reducing memory fragmentation:
# Enable paged attentionzse serve model.zse --paged-attention # Configure block sizezse serve model.zse --paged-attention --block-size 16- Reduces memory fragmentation
- Enables dynamic batch sizes
- More concurrent requests with same memory
Memory Tiers
ZSE can use CPU memory and disk as overflow for KV cache:
kv_cache: tiers: - type: gpu size: 8GB priority: 1 - type: cpu size: 32GB priority: 2 - type: disk path: /tmp/zse_kv_cache size: 100GB priority: 3 eviction: lru # Evict least-recently-usedCache Statistics
from zllm_zse import ZSE model = ZSE("qwen-7b.zse") # Get cache statisticsstats = model.kv_cache_stats()print(stats)# {# 'total_size': 8589934592, # 8 GB# 'used_size': 2147483648, # 2 GB# 'prompt_cache_hits': 1542,# 'prompt_cache_misses': 89,# 'hit_rate': 0.945,# 'evictions': 23# } # Clear cachemodel.clear_kv_cache() # Clear specific sessionmodel.clear_kv_cache(session="user_123")