Feature

RAG Module

Retrieval-Augmented Generation for grounding LLM responses in your documents.

Overview

The RAG (Retrieval-Augmented Generation) module allows you to upload documents and automatically inject relevant context into your LLM conversations. This grounds the model's responses in your data, reducing hallucinations and enabling domain-specific knowledge.

Document Upload

PDF, TXT, MD file support

Smart Chunking

Intelligent text splitting with overlap

Semantic Search

Find relevant context automatically

Support for PDF, TXT, and Markdown files
Smart text chunking with configurable overlap
TF-IDF or sentence-transformers embeddings
SQLite + NumPy vector storage
Semantic search with top-k retrieval
Source citations in results

How It Works

text

RAG Pipeline:
 
┌─────────────┐     ┌─────────────┐     ┌─────────────┐
│  Document   │ ──▶ │   Chunker   │ ──▶ │  Embedder   │
│   Upload    │     │  (split)    │     │ (vectorize) │
└─────────────┘     └─────────────┘     └──────┬──────┘
                                               │
                                               ▼
┌─────────────┐     ┌─────────────┐     ┌─────────────┐
│   Response  │ ◀── │     LLM     │ ◀── │   Vector    │
│   + Source  │     │  (generate) │     │   Store     │
└─────────────┘     └─────────────┘     └─────────────┘
 
At query time:
1. Query is embedded
2. Similar chunks retrieved from vector store
3. Context injected into LLM prompt
4. Response includes source citations

The chunker uses smart text splitting that respects paragraph boundaries and includes overlap to maintain context across chunks.

Document Upload

Upload File

bash

# Upload a PDF document
curl -X POST http://localhost:8000/api/rag/documents/upload \
  -F "file=@knowledge_base.pdf"
 
# Upload a text file
curl -X POST http://localhost:8000/api/rag/documents/upload \
  -F "file=@documentation.txt"
 
# Upload markdown
curl -X POST http://localhost:8000/api/rag/documents/upload \
  -F "file=@readme.md"

Upload Raw Content

bash

# Add document by content
curl -X POST http://localhost:8000/api/rag/documents \
  -H "Content-Type: application/json" \
  -d '{
    "content": "Your document content here...",
    "metadata": {
      "title": "Company Policies",
      "source": "HR Department"
    }
  }'

Python API

python

import requests
 
# Upload a file
with open("document.pdf", "rb") as f:
    response = requests.post(
        "http://localhost:8000/api/rag/documents/upload",
        files={"file": f}
    )
    doc_id = response.json()["id"]
    print(f"Document uploaded: {doc_id}")
 
# Add content directly
response = requests.post(
    "http://localhost:8000/api/rag/documents",
    json={
        "content": "ZSE is an ultra memory-efficient LLM inference engine...",
        "metadata": {"title": "ZSE Overview"}
    }
)

List Documents

bash

# List all documents
curl http://localhost:8000/api/rag/documents
 
# Response:
# {
#   "documents": [
#     {"id": "abc123", "title": "Company Policies", "chunks": 15},
#     {"id": "def456", "title": "Product Manual", "chunks": 42}
#   ]
# }

Delete Document

bash

# Delete a document
curl -X DELETE http://localhost:8000/api/rag/documents/abc123

Searching Documents

Semantic Search

bash

# Search for relevant content
curl -X POST http://localhost:8000/api/rag/search \
  -H "Content-Type: application/json" \
  -d '{
    "query": "What is the return policy?",
    "top_k": 5
  }'
 
# Response:
# {
#   "results": [
#     {
#       "content": "Returns are accepted within 30 days...",
#       "score": 0.89,
#       "document_id": "abc123",
#       "metadata": {"title": "Company Policies", "chunk": 3}
#     },
#     ...
#   ]
# }

Get Context for Chat

bash

# Get context formatted for chat injection
curl -X POST http://localhost:8000/api/rag/context \
  -H "Content-Type: application/json" \
  -d '{
    "query": "How do I configure the server?",
    "top_k": 3,
    "include_sources": true
  }'
 
# Response:
# {
#   "context": "Based on the documentation:\n\n1. Edit config.yaml...",
#   "sources": [
#     {"title": "Configuration Guide", "relevance": 0.92},
#     {"title": "Quick Start", "relevance": 0.78}
#   ]
# }

Chat Integration

The RAG module integrates seamlessly with the chat API to automatically inject relevant context:

RAG-Enhanced Chat

python

import requests
 
# Chat with RAG context
response = requests.post(
    "http://localhost:8000/v1/chat/completions",
    json={
        "model": "qwen-7b",
        "messages": [
            {"role": "user", "content": "What's our refund policy?"}
        ],
        "rag": {
            "enabled": True,
            "top_k": 3
        }
    }
)
 
result = response.json()
print(result["choices"][0]["message"]["content"])
# The response will be grounded in your uploaded documents

Playground Integration

The ZSE playground at /chat includes RAG controls in the sidebar. Upload documents and toggle RAG to see context-aware responses.

For best results, upload documents that are specific to your use case. The more relevant your documents, the better the RAG context.

API Reference

Endpoint	Method	Description
/api/rag/documents	POST	Add document by content
/api/rag/documents/upload	POST	Upload document file
/api/rag/documents	GET	List all documents
/api/rag/documents/{id}	DELETE	Delete document
/api/rag/search	POST	Search documents
/api/rag/context	POST	Get context for chat

← GGUF Compatibility

MCP Tools →