Feature

zKV

Intelligent KV cache management for optimal memory usage and prompt caching.

Overview

zKV manages the key-value cache used during inference, enabling prompt caching, memory optimization, and long-context conversations.

Prompt Caching

Reuse computations for repeated prompts

4-bit KV

Compressed cache for long contexts

Persistence

Save/load KV state to disk

Automatic prompt prefix caching
4-bit KV cache compression
Paged attention for dynamic memory
Disk-backed KV for very long contexts
Multi-user cache isolation

How It Works

The KV (key-value) cache stores intermediate computations from the attention mechanism. Without caching, these must be recomputed for every token generated.

text

Without KV Cache:
┌──────────────────────────────────────────────────────┐
│ Prompt: "The quick brown fox"                        │
│ → Compute attention for all tokens (4 forward passes)│
│ → Generate "jumps" (recompute all + new token)       │
│ → Generate "over"  (recompute all + new tokens)      │
│ → Total: O(n²) computations                          │
└──────────────────────────────────────────────────────┘
 
With KV Cache:
┌──────────────────────────────────────────────────────┐
│ Prompt: "The quick brown fox"                        │
│ → Compute attention, STORE in KV cache               │
│ → Generate "jumps" (reuse cache + 1 new computation) │
│ → Generate "over"  (reuse cache + 1 new computation) │
│ → Total: O(n) computations                           │
└──────────────────────────────────────────────────────┘

KV caching typically provides 10-100x speedup for generation after the initial prompt processing.

Configuration

Cache Size

bash

# Set maximum context length (determines cache size)
zse serve model.zse --max-context 8192
 
# Set maximum cache memory
zse serve model.zse --kv-cache-memory 4GB
 
# Dynamic cache sizing
zse serve model.zse --kv-cache dynamic

Memory requirements per context length (7B model):

Context	FP16 KV	4-bit KV
4,096	1.0 GB	0.3 GB
8,192	2.0 GB	0.5 GB
32,768	8.0 GB	2.0 GB
131,072	32 GB	8.0 GB

Prompt Caching

ZSE automatically caches common prompt prefixes:

python

from zllm_zse import ZSE
 
model = ZSE("qwen-7b.zse")
 
# First request - full computation
response1 = model.chat([
    {"role": "system", "content": "You are a helpful assistant..."},  # Cached
    {"role": "user", "content": "Hello!"}
])  # ~500ms
 
# Second request - reuses system prompt cache
response2 = model.chat([
    {"role": "system", "content": "You are a helpful assistant..."},  # From cache!
    {"role": "user", "content": "How are you?"}
])  # ~50ms (10x faster)

Configure prompt caching:

zse.yaml

kv_cache:
  prompt_cache: true
  prompt_cache_size: 1GB
  prompt_cache_ttl: 3600  # seconds
  
  # Cache specific prefixes
  prefixes:
    - name: "default_system"
      content: "You are a helpful assistant..."
      preload: true

KV Cache Compression

Enable 4-bit KV cache for longer contexts:

bash

# Enable 4-bit KV cache
zse serve model.zse --kv-quant int4
 
# 8-bit KV cache (better quality)
zse serve model.zse --kv-quant int8

python

from zllm_zse import ZSE
 
# Enable quantized KV cache
model = ZSE("qwen-7b.zse", kv_quant="int4")
 
# Now supports 4x longer contexts!
response = model.chat(
    messages=[...],
    max_context=131072  # 128K context
)

Persistent Cache

Save and restore KV cache for long-running conversations:

python

from zllm_zse import ZSE
 
model = ZSE("qwen-7b.zse")
 
# Start conversation
session = model.create_session()
session.chat([{"role": "user", "content": "My name is Alice"}])
session.chat([{"role": "user", "content": "I live in New York"}])
 
# Save session state (includes KV cache)
session.save("alice_session.zse")
 
# Later: restore session
session = model.load_session("alice_session.zse")
response = session.chat([{"role": "user", "content": "What's my name?"}])
# → "Your name is Alice"

Session Management API

python

from zllm_zse import ZSE, Session
 
model = ZSE("qwen-7b.zse")
 
# Create named session
session = model.create_session(name="user_123")
 
# List sessions
sessions = model.list_sessions()
print(sessions)  # ['user_123', 'user_456', ...]
 
# Get session info
info = session.info()
print(info)
# {
#   'name': 'user_123',
#   'context_length': 1024,
#   'cache_size_bytes': 52428800,
#   'created_at': '2024-01-15T10:30:00Z'
# }
 
# Clear session
session.clear()
 
# Delete session
model.delete_session("user_123")

Memory Optimization

Paged Attention

Paged attention allocates KV cache in fixed-size blocks, reducing memory fragmentation:

bash

# Enable paged attention
zse serve model.zse --paged-attention
 
# Configure block size
zse serve model.zse --paged-attention --block-size 16

Reduces memory fragmentation
Enables dynamic batch sizes
More concurrent requests with same memory

Memory Tiers

ZSE can use CPU memory and disk as overflow for KV cache:

zse.yaml

kv_cache:
  tiers:
    - type: gpu
      size: 8GB
      priority: 1
      
    - type: cpu
      size: 32GB
      priority: 2
      
    - type: disk
      path: /tmp/zse_kv_cache
      size: 100GB
      priority: 3
      
  eviction: lru  # Evict least-recently-used

CPU and disk tiers add latency. Use GPU memory for latency-sensitive workloads.

Cache Statistics

python

from zllm_zse import ZSE
 
model = ZSE("qwen-7b.zse")
 
# Get cache statistics
stats = model.kv_cache_stats()
print(stats)
# {
#   'total_size': 8589934592,      # 8 GB
#   'used_size': 2147483648,       # 2 GB
#   'prompt_cache_hits': 1542,
#   'prompt_cache_misses': 89,
#   'hit_rate': 0.945,
#   'evictions': 23
# }
 
# Clear cache
model.clear_kv_cache()
 
# Clear specific session
model.clear_kv_cache(session="user_123")

← zStream

CLI Reference →