Feature

zInfer

High-performance local inference for transformer models with optimized sampling.

Overview

zInfer provides direct inference capabilities for both interactive chat and programmatic text generation.

~100 tok/s

High throughput on consumer GPUs

Flash Attention

Memory-efficient attention

Speculative

2-3x faster with draft models

Optimized CUDA kernels for inference
Flash Attention 2 support
Speculative decoding with draft models
Continuous batching for throughput
Custom sampling strategies

CLI Usage

Interactive Chat

bash

# Start interactive chat
zse chat qwen-7b.zse
 
# With system prompt
zse chat qwen-7b.zse --system "You are a helpful coding assistant"
 
# With initial prompt
zse chat qwen-7b.zse -p "Explain quantum computing"

Chat commands:

Command	Description
`/clear`	Clear conversation history
`/system <prompt>`	Set system prompt
`/temp <value>`	Set temperature (0.0-2.0)
`/save <file>`	Save conversation to file
`/quit`	Exit chat

Text Completion

bash

# Single completion
zse complete qwen-7b.zse -p "The quick brown fox"
 
# With parameters
zse complete qwen-7b.zse \
  -p "Write a poem about AI" \
  --max-tokens 200 \
  --temperature 0.8

Python API

Quick Inference

python

from zllm_zse import ZSE
 
# Load model
model = ZSE("qwen-7b.zse")
 
# Chat completion
response = model.chat([
    {"role": "user", "content": "Hello!"}
])
print(response)
 
# Text completion
text = model.complete("The meaning of life is")
print(text)

Streaming

python

from zllm_zse import ZSE
 
model = ZSE("qwen-7b.zse")
 
# Stream chat response
for chunk in model.chat_stream([
    {"role": "user", "content": "Tell me a story"}
]):
    print(chunk, end="", flush=True)
 
# Stream completion
for token in model.complete_stream("Once upon a time"):
    print(token, end="", flush=True)

Async API

python

import asyncio
from zllm_zse import AsyncZSE
 
async def main():
    model = AsyncZSE("qwen-7b.zse")
    
    # Async chat
    response = await model.chat([
        {"role": "user", "content": "Hello!"}
    ])
    
    # Async streaming
    async for chunk in model.chat_stream([
        {"role": "user", "content": "Tell me a story"}
    ]):
        print(chunk, end="")
 
asyncio.run(main())

Parameters

Parameter	Default	Description
`temperature`	0.7	Sampling randomness (0.0-2.0)
`top_p`	0.9	Nucleus sampling threshold
`top_k`	50	Top-k sampling (0 = disabled)
`max_tokens`	2048	Maximum tokens to generate
`repetition_penalty`	1.0	Penalty for repeated tokens
`stop`	[]	Stop sequences

python

# Custom parameters
response = model.chat(
    messages=[{"role": "user", "content": "Write code"}],
    temperature=0.2,      # Lower for code
    top_p=0.95,
    max_tokens=1000,
    stop=["```"]       # Stop at code block end
)

Chat Templates

ZSE automatically detects and applies the correct chat template for each model. Override if needed:

python

from zllm_zse import ZSE
 
# Use built-in template
model = ZSE("qwen-7b.zse")  # Auto-detects Qwen template
 
# Override template
model = ZSE("custom-model.zse", chat_template="chatml")
 
# Custom template string
model = ZSE("model.zse", chat_template="""
{%- for message in messages %}
{%- if message['role'] == 'user' %}
User: {{ message['content'] }}
{%- elif message['role'] == 'assistant' %}
Assistant: {{ message['content'] }}
{%- endif %}
{%- endfor %}
Assistant:""")

Built-in templates:

chatmlllamamistralvicunaalpacazephyr

Batch Inference

Process multiple prompts efficiently:

python

from zllm_zse import ZSE
 
model = ZSE("qwen-7b.zse")
 
# Batch completion
prompts = [
    "Translate to French: Hello",
    "Translate to French: Goodbye", 
    "Translate to French: Thank you",
]
 
results = model.complete_batch(prompts)
for prompt, result in zip(prompts, results):
    print(f"{prompt} -> {result}")
 
# Batch chat
conversations = [
    [{"role": "user", "content": "What is 2+2?"}],
    [{"role": "user", "content": "What is 3+3?"}],
    [{"role": "user", "content": "What is 4+4?"}],
]
 
responses = model.chat_batch(conversations)
for response in responses:
    print(response)

Batch inference can be 2-4x faster than sequential inference due to GPU parallelism.

← zServe

zStream →