zInfer
High-performance local inference for transformer models with optimized sampling.
Overview
zInfer provides direct inference capabilities for both interactive chat and programmatic text generation.
~100 tok/s
High throughput on consumer GPUs
Flash Attention
Memory-efficient attention
Speculative
2-3x faster with draft models
- Optimized CUDA kernels for inference
- Flash Attention 2 support
- Speculative decoding with draft models
- Continuous batching for throughput
- Custom sampling strategies
CLI Usage
Interactive Chat
bash
# Start interactive chatzse chat qwen-7b.zse # With system promptzse chat qwen-7b.zse --system "You are a helpful coding assistant" # With initial promptzse chat qwen-7b.zse -p "Explain quantum computing"Chat commands:
| Command | Description |
|---|---|
/clear | Clear conversation history |
/system <prompt> | Set system prompt |
/temp <value> | Set temperature (0.0-2.0) |
/save <file> | Save conversation to file |
/quit | Exit chat |
Text Completion
bash
# Single completionzse complete qwen-7b.zse -p "The quick brown fox" # With parameterszse complete qwen-7b.zse \ -p "Write a poem about AI" \ --max-tokens 200 \ --temperature 0.8Python API
Quick Inference
python
from zllm_zse import ZSE # Load modelmodel = ZSE("qwen-7b.zse") # Chat completionresponse = model.chat([ {"role": "user", "content": "Hello!"}])print(response) # Text completiontext = model.complete("The meaning of life is")print(text)Streaming
python
from zllm_zse import ZSE model = ZSE("qwen-7b.zse") # Stream chat responsefor chunk in model.chat_stream([ {"role": "user", "content": "Tell me a story"}]): print(chunk, end="", flush=True) # Stream completionfor token in model.complete_stream("Once upon a time"): print(token, end="", flush=True)Async API
python
import asynciofrom zllm_zse import AsyncZSE async def main(): model = AsyncZSE("qwen-7b.zse") # Async chat response = await model.chat([ {"role": "user", "content": "Hello!"} ]) # Async streaming async for chunk in model.chat_stream([ {"role": "user", "content": "Tell me a story"} ]): print(chunk, end="") asyncio.run(main())Parameters
| Parameter | Default | Description |
|---|---|---|
temperature | 0.7 | Sampling randomness (0.0-2.0) |
top_p | 0.9 | Nucleus sampling threshold |
top_k | 50 | Top-k sampling (0 = disabled) |
max_tokens | 2048 | Maximum tokens to generate |
repetition_penalty | 1.0 | Penalty for repeated tokens |
stop | [] | Stop sequences |
python
# Custom parametersresponse = model.chat( messages=[{"role": "user", "content": "Write code"}], temperature=0.2, # Lower for code top_p=0.95, max_tokens=1000, stop=["```"] # Stop at code block end)Chat Templates
ZSE automatically detects and applies the correct chat template for each model. Override if needed:
python
from zllm_zse import ZSE # Use built-in templatemodel = ZSE("qwen-7b.zse") # Auto-detects Qwen template # Override templatemodel = ZSE("custom-model.zse", chat_template="chatml") # Custom template stringmodel = ZSE("model.zse", chat_template="""{%- for message in messages %}{%- if message['role'] == 'user' %}User: {{ message['content'] }}{%- elif message['role'] == 'assistant' %}Assistant: {{ message['content'] }}{%- endif %}{%- endfor %}Assistant:""")Built-in templates:
chatmlllamamistralvicunaalpacazephyr
Batch Inference
Process multiple prompts efficiently:
python
from zllm_zse import ZSE model = ZSE("qwen-7b.zse") # Batch completionprompts = [ "Translate to French: Hello", "Translate to French: Goodbye", "Translate to French: Thank you",] results = model.complete_batch(prompts)for prompt, result in zip(prompts, results): print(f"{prompt} -> {result}") # Batch chatconversations = [ [{"role": "user", "content": "What is 2+2?"}], [{"role": "user", "content": "What is 3+3?"}], [{"role": "user", "content": "What is 4+4?"}],] responses = model.chat_batch(conversations)for response in responses: print(response)Batch inference can be 2-4x faster than sequential inference due to GPU parallelism.