voice-optimal-RAG · davidbmar.com

What it is

A lightweight, single-container Retrieval-Augmented Generation (RAG) backend designed to feed context into voice assistants. It ingests PDF, Markdown, TXT, DOCX, and HTML files, splits them into token-aware chunks, generates 768-dimensional vectors using Nomic Embed Text v1.5, and stores them in an embedded LanceDB instance. It exposes a REST API and a simple web UI for uploading documents and performing semantic similarity searches.

Features

Single-container deployment with pre-loaded embedding models
Supports PDF, Markdown, TXT, DOCX, and HTML file ingestion
Semantic search using Nomic Embed Text v1.5 (768-dim vectors)
Embedded LanceDB vector store with no external database server
Recursive text splitting with tiktoken-aware chunking
Built-in Web UI for drag-and-drop upload and query testing

Quickstart

docker build -t rag-service .
docker run -d --name rag-service -p 8100:8100 -v rag-data:/data rag-service
open http://localhost:8100

Architecture

flowchart TD
    Client[Client / Voice Assistant] -->|HTTP POST /upload| API[FastAPI App]
    Client -->|HTTP POST /query| API
    API -->|Orchestrate| Pipeline[Document Pipeline]
    Pipeline -->|Parse| Parsers[Parsers: PyMuPDF/Text]
    Parsers -->|Raw Text| Chunker[Chunker: tiktoken]
    Chunker -->|Text Chunks| Embedder[Embedder: SentenceTransformers]
    Embedder -->|Vectors| Store[LanceDB Vector Store]
    Store -->|Similarity Results| API
    API -->|JSON Response| Client
    subgraph Data Persistence
    Store -->|/data/lancedb| Volume[Docker Volume]
    end

How it's built

Built with Python FastAPI for the web server, sentence-transformers for embedding generation, and LanceDB for vector storage. Text processing uses PyMuPDF for PDFs and tiktoken for recursive character splitting. The entire stack is containerized in a single Docker image with the embedding model pre-downloaded to ensure fast startup and zero external dependencies.

How it runs

sequenceDiagram
    participant C as Client
    participant A as FastAPI App
    participant P as Document Pipeline
    participant E as Embedder
    participant V as LanceDB

    Note over C, V: Ingestion Flow
    C->>A: POST /upload (files)
    A->>P: ingest_file(filepath)
    P->>P: Parse file (parsers.py)
    P->>P: Chunk text (chunker.py)
    P->>E: embed_batch(chunks)
    E->>E: Model.encode
    E-->>P: Vectors
    P->>V: insert_chunks(vectors)
    V-->>P: Confirm Storage
    P-->>A: Document Info
    A-->>C: JSON Response

    Note over C, V: Query Flow
    C->>A: POST /query {query, top_k}
    A->>E: embed_text(query)
    E->>E: Model.encode
    E-->>A: Query Vector
    A->>V: search(vector, top_k)
    V-->>A: Ranked Results
    A-->>C: JSON Results with Scores

How to apply & reuse

Deploy as a sidecar or standalone service for applications requiring local, private semantic search over documentation. Ideal for voice assistant backends where low-latency context retrieval is needed without relying on external cloud vector databases. Use the provided GitHub indexing scripts to automatically keep technical documentation up-to-date.

At a glance

CapabilitiesFile IngestionSemantic SearchVector EmbeddingDocument ManagementHealth Monitoring

Componentsapp.pydocument_pipeline.pyparsers.pychunker.pyembedder.pyvector_store.pystatic/index.html

TechPythonFastAPILanceDBSentenceTransformersPyMuPDFTiktokenDocker

Depends onsentence-transformerslancedbfastapiuvicornpymupdftiktokenpython-multipart

Integrates withVoice AssistantsGitHub RepositoriesLLM Context Windows

PatternsRetrieval-Augmented GenerationEmbedded DatabaseSingle Container ServiceAsymmetric Semantic Search

Reuse tagsragvector-searchlancedbfastapidockernomic-embedself-hosted

⚠ Needs attention

unmerged_branch: dependabot/pip/pip-14c377a4fb is 1 commit ahead of the default branch
open_pr: PR #1: Bump python-multipart from 0.0.20 to 0.0.27 in the pip group across 1 directory