Feature

zInfer

High-performance local inference for transformer models with optimized sampling.

Overview

zInfer provides direct inference capabilities for both interactive chat and programmatic text generation.

~100 tok/s

High throughput on consumer GPUs

Flash Attention

Memory-efficient attention

Speculative

2-3x faster with draft models

  • Optimized CUDA kernels for inference
  • Flash Attention 2 support
  • Speculative decoding with draft models
  • Continuous batching for throughput
  • Custom sampling strategies

CLI Usage

Interactive Chat

bash
# Start interactive chat
zse chat qwen-7b.zse
# With system prompt
zse chat qwen-7b.zse --system "You are a helpful coding assistant"
# With initial prompt
zse chat qwen-7b.zse -p "Explain quantum computing"

Chat commands:

CommandDescription
/clearClear conversation history
/system <prompt>Set system prompt
/temp <value>Set temperature (0.0-2.0)
/save <file>Save conversation to file
/quitExit chat

Text Completion

bash
# Single completion
zse complete qwen-7b.zse -p "The quick brown fox"
# With parameters
zse complete qwen-7b.zse \
-p "Write a poem about AI" \
--max-tokens 200 \
--temperature 0.8

Python API

Quick Inference

python
from zllm_zse import ZSE
# Load model
model = ZSE("qwen-7b.zse")
# Chat completion
response = model.chat([
{"role": "user", "content": "Hello!"}
])
print(response)
# Text completion
text = model.complete("The meaning of life is")
print(text)

Streaming

python
from zllm_zse import ZSE
model = ZSE("qwen-7b.zse")
# Stream chat response
for chunk in model.chat_stream([
{"role": "user", "content": "Tell me a story"}
]):
print(chunk, end="", flush=True)
# Stream completion
for token in model.complete_stream("Once upon a time"):
print(token, end="", flush=True)

Async API

python
import asyncio
from zllm_zse import AsyncZSE
async def main():
model = AsyncZSE("qwen-7b.zse")
# Async chat
response = await model.chat([
{"role": "user", "content": "Hello!"}
])
# Async streaming
async for chunk in model.chat_stream([
{"role": "user", "content": "Tell me a story"}
]):
print(chunk, end="")
asyncio.run(main())

Parameters

ParameterDefaultDescription
temperature0.7Sampling randomness (0.0-2.0)
top_p0.9Nucleus sampling threshold
top_k50Top-k sampling (0 = disabled)
max_tokens2048Maximum tokens to generate
repetition_penalty1.0Penalty for repeated tokens
stop[]Stop sequences
python
# Custom parameters
response = model.chat(
messages=[{"role": "user", "content": "Write code"}],
temperature=0.2, # Lower for code
top_p=0.95,
max_tokens=1000,
stop=["```"] # Stop at code block end
)

Chat Templates

ZSE automatically detects and applies the correct chat template for each model. Override if needed:

python
from zllm_zse import ZSE
# Use built-in template
model = ZSE("qwen-7b.zse") # Auto-detects Qwen template
# Override template
model = ZSE("custom-model.zse", chat_template="chatml")
# Custom template string
model = ZSE("model.zse", chat_template="""
{%- for message in messages %}
{%- if message['role'] == 'user' %}
User: {{ message['content'] }}
{%- elif message['role'] == 'assistant' %}
Assistant: {{ message['content'] }}
{%- endif %}
{%- endfor %}
Assistant:""")

Built-in templates:

chatmlllamamistralvicunaalpacazephyr

Batch Inference

Process multiple prompts efficiently:

python
from zllm_zse import ZSE
model = ZSE("qwen-7b.zse")
# Batch completion
prompts = [
"Translate to French: Hello",
"Translate to French: Goodbye",
"Translate to French: Thank you",
]
results = model.complete_batch(prompts)
for prompt, result in zip(prompts, results):
print(f"{prompt} -> {result}")
# Batch chat
conversations = [
[{"role": "user", "content": "What is 2+2?"}],
[{"role": "user", "content": "What is 3+3?"}],
[{"role": "user", "content": "What is 4+4?"}],
]
responses = model.chat_batch(conversations)
for response in responses:
print(response)
Batch inference can be 2-4x faster than sequential inference due to GPU parallelism.