rag-document-chat · davidbmar.com

What it is

This project implements a complete RAG pipeline that ingests PDF documents, processes them using hierarchical semantic grouping and compression, stores embeddings in ChromaDB, and provides a chat interface via Streamlit backed by a FastAPI server. It supports optional S3 storage for documents and uses OpenAI for both embedding generation and LLM responses.

Features

Hierarchical document processing with logical sentence grouping and 10:1 compression summaries
PDF text extraction with page number and section title metadata tracking
Vector storage and retrieval using ChromaDB with enhanced chunk metadata
FastAPI backend for document upload and query handling with CORS support
Streamlit frontend for interactive document chatting
Optional AWS S3 integration for persistent document storage

Quickstart

pip install -r requirements.txt
python -m nltk.downloader punkt_tab punkt stopwords
uvicorn app:app --reload
streamlit run streamlit_app.py

Architecture

flowchart TD
    User[User] -->|Uploads PDF| Streamlit[Streamlit Frontend]
    Streamlit -->|POST /upload| FastAPI[FastAPI Backend]
    FastAPI -->|Extract Text| PyPDF2[PyPDF2]
    PyPDF2 -->|Raw Text| HierarchicalProc[HierarchicalProcessor]
    HierarchicalProc -->|Logical Groups & Summaries| Embedder[OpenAI Embeddings]
    Embedder -->|Vectors| ChromaDB[(ChromaDB)]
    User -->|Asks Question| Streamlit
    Streamlit -->|POST /query| FastAPI
    FastAPI -->|Search| ChromaDB
    ChromaDB -->|Context Chunks| FastAPI
    FastAPI -->|Prompt + Context| OpenAI[OpenAI LLM]
    OpenAI -->|Answer| FastAPI
    FastAPI -->|Response| Streamlit

How it's built

The backend is built with FastAPI to handle document uploads and query processing. It uses PyPDF2 for text extraction, NLTK for tokenization, and a custom HierarchicalProcessor to group sentences into logical units and generate compressed summaries (10:1 ratio). Embeddings are stored in ChromaDB (local or HTTP mode). The frontend is a Streamlit application that interacts with the FastAPI backend. AWS Boto3 is used for optional S3 integration.

How it runs

sequenceDiagram
    participant U as User
    participant S as Streamlit UI
    participant F as FastAPI Server
    participant P as HierarchicalProcessor
    participant C as ChromaDB
    participant O as OpenAI API

    Note over U, O: Document Ingestion Phase
    U->>S: Upload PDF File
    S->>F: POST /upload file
    F->>F: Parse PDF with PyPDF2
    F->>P: Process Text Hierarchically
    P->>P: Group Sentences & Generate Summaries
    P->>O: Generate Embeddings for Chunks
    O-->>P: Return Vectors
    P->>C: Store Chunks with Metadata
    C-->>F: Confirm Storage
    F-->>S: Upload Success
    S-->>U: Display Success Message

    Note over U, O: Query Phase
    U->>S: Ask Question
    S->>F: POST /query {question}
    F->>O: Generate Query Embedding
    O-->>F: Return Query Vector
    F->>C: Similarity Search
    C-->>F: Return Relevant Chunks
    F->>O: Send Prompt with Context
    O-->>F: Return Generated Answer
    F-->>S: Return Answer
    S-->>U: Display Answer

How to apply & reuse

Use this template to build internal knowledge base chatbots, legal document reviewers, or academic paper assistants. It is suitable for developers needing a structured RAG implementation with advanced chunking strategies beyond simple character splitting.

At a glance

CapabilitiesDocument IngestionSemantic ChunkingVector SearchLLM GenerationMetadata TrackingCloud Storage Integration

ComponentsFastAPI ApplicationStreamlit InterfaceHierarchical ProcessorEnhanced Document ProcessorChromaDB ClientS3 Uploader

TechPythonFastAPIStreamlitChromaDBOpenAI APIPyPDF2NLTKBoto3LangChain

Depends onopenaichromadbfastapistreamlitpypdf2boto3nltklangchainpydanticuvicorn

Integrates withAWS S3OpenAI GPT ModelsLocal File System

PatternsRetrieval-Augmented Generation (RAG)Microservices (API + UI)Hierarchical Data ProcessingVector Database Indexing

Reuse tagsragllmdocument-chatfastapistreamlitchromadbpdf-processing