Feature

RAG Module

Retrieval-Augmented Generation for grounding LLM responses in your documents.

Overview

The RAG (Retrieval-Augmented Generation) module allows you to upload documents and automatically inject relevant context into your LLM conversations. This grounds the model's responses in your data, reducing hallucinations and enabling domain-specific knowledge.

Document Upload

PDF, TXT, MD file support

Smart Chunking

Intelligent text splitting with overlap

Semantic Search

Find relevant context automatically

  • Support for PDF, TXT, and Markdown files
  • Smart text chunking with configurable overlap
  • TF-IDF or sentence-transformers embeddings
  • SQLite + NumPy vector storage
  • Semantic search with top-k retrieval
  • Source citations in results

How It Works

text
RAG Pipeline:
┌─────────────┐ ┌─────────────┐ ┌─────────────┐
│ Document │ ──▶ │ Chunker │ ──▶ │ Embedder │
│ Upload │ │ (split) │ │ (vectorize) │
└─────────────┘ └─────────────┘ └──────┬──────┘
┌─────────────┐ ┌─────────────┐ ┌─────────────┐
│ Response │ ◀── │ LLM │ ◀── │ Vector │
│ + Source │ │ (generate) │ │ Store │
└─────────────┘ └─────────────┘ └─────────────┘
At query time:
1. Query is embedded
2. Similar chunks retrieved from vector store
3. Context injected into LLM prompt
4. Response includes source citations
The chunker uses smart text splitting that respects paragraph boundaries and includes overlap to maintain context across chunks.

Document Upload

Upload File

bash
# Upload a PDF document
curl -X POST http://localhost:8000/api/rag/documents/upload \
-F "file=@knowledge_base.pdf"
# Upload a text file
curl -X POST http://localhost:8000/api/rag/documents/upload \
-F "file=@documentation.txt"
# Upload markdown
curl -X POST http://localhost:8000/api/rag/documents/upload \
-F "file=@readme.md"

Upload Raw Content

bash
# Add document by content
curl -X POST http://localhost:8000/api/rag/documents \
-H "Content-Type: application/json" \
-d '{
"content": "Your document content here...",
"metadata": {
"title": "Company Policies",
"source": "HR Department"
}
}'

Python API

python
import requests
# Upload a file
with open("document.pdf", "rb") as f:
response = requests.post(
"http://localhost:8000/api/rag/documents/upload",
files={"file": f}
)
doc_id = response.json()["id"]
print(f"Document uploaded: {doc_id}")
# Add content directly
response = requests.post(
"http://localhost:8000/api/rag/documents",
json={
"content": "ZSE is an ultra memory-efficient LLM inference engine...",
"metadata": {"title": "ZSE Overview"}
}
)

List Documents

bash
# List all documents
curl http://localhost:8000/api/rag/documents
# Response:
# {
# "documents": [
# {"id": "abc123", "title": "Company Policies", "chunks": 15},
# {"id": "def456", "title": "Product Manual", "chunks": 42}
# ]
# }

Delete Document

bash
# Delete a document
curl -X DELETE http://localhost:8000/api/rag/documents/abc123

Searching Documents

bash
# Search for relevant content
curl -X POST http://localhost:8000/api/rag/search \
-H "Content-Type: application/json" \
-d '{
"query": "What is the return policy?",
"top_k": 5
}'
# Response:
# {
# "results": [
# {
# "content": "Returns are accepted within 30 days...",
# "score": 0.89,
# "document_id": "abc123",
# "metadata": {"title": "Company Policies", "chunk": 3}
# },
# ...
# ]
# }

Get Context for Chat

bash
# Get context formatted for chat injection
curl -X POST http://localhost:8000/api/rag/context \
-H "Content-Type: application/json" \
-d '{
"query": "How do I configure the server?",
"top_k": 3,
"include_sources": true
}'
# Response:
# {
# "context": "Based on the documentation:\n\n1. Edit config.yaml...",
# "sources": [
# {"title": "Configuration Guide", "relevance": 0.92},
# {"title": "Quick Start", "relevance": 0.78}
# ]
# }

Chat Integration

The RAG module integrates seamlessly with the chat API to automatically inject relevant context:

RAG-Enhanced Chat

python
import requests
# Chat with RAG context
response = requests.post(
"http://localhost:8000/v1/chat/completions",
json={
"model": "qwen-7b",
"messages": [
{"role": "user", "content": "What's our refund policy?"}
],
"rag": {
"enabled": True,
"top_k": 3
}
}
)
result = response.json()
print(result["choices"][0]["message"]["content"])
# The response will be grounded in your uploaded documents

Playground Integration

The ZSE playground at /chat includes RAG controls in the sidebar. Upload documents and toggle RAG to see context-aware responses.

For best results, upload documents that are specific to your use case. The more relevant your documents, the better the RAG context.

API Reference

EndpointMethodDescription
/api/rag/documentsPOSTAdd document by content
/api/rag/documents/uploadPOSTUpload document file
/api/rag/documentsGETList all documents
/api/rag/documents/{id}DELETEDelete document
/api/rag/searchPOSTSearch documents
/api/rag/contextPOSTGet context for chat