Browser-Native RAG Retrieval Layer

A fully local, WebGPU-accelerated RAG retrieval system that runs entirely in the browser with zero network calls at query time.

https://github.com/davidbmar/character-iris-kade-noir-cyberpunk-ai-browser-native-rag-llm-tts-stt  ·  public  ·  shipped

Browser-Native RAG Retrieval Layer screenshot

What it is

This project implements a production-shaped prototype for real-time transcription search where both embedding generation and vector search occur locally within the client's browser. It decouples retrieval from generation, allowing any downstream LLM to be paired with it. The system uses an offline Node.js CLI to index JSONL transcript chunks into HNSW artifacts, which are then loaded by a Vite-based web application. By leveraging Transformers.js and WebGPU, it achieves ~18ms query latency on modern hardware without sending data to external APIs.

Features

Quickstart

git clone https://github.com/davidbmar/browser-RAG-retrieval-realtime-night-index-transformersjs-webgpu-web-app-vite-typescript.git
cd browser-RAG-retrieval-realtime-night-index-transformersjs-webgpu-web-app-vite-typescript
cd indexer-node && npm install
cd ../web-app && npm install
cd ..
cd indexer-node
npm run build-index -- --input ../data --out ../artifacts
cd ../web-app
mkdir -p public/artifacts
cp ../artifacts/* public/artifacts/
npm run dev

Architecture

flowchart TD
    subgraph Offline_Indexing["Offline Indexing (Node.js)"]
        JSONL["JSONL Chunks"] --> Validate["Validate & Normalize"]
        Validate --> Embed["Embed via Transformers.js"]
        Embed --> HNSW["Build HNSW Index"]
        HNSW --> Artifacts["Export Artifacts\n(config, metadata, embeddings.bin, hnsw.index)"]
    end

    subgraph Browser_Runtime["Browser Runtime (Vite + TS)"]
        MainThread["Main Thread\n(UI + Transcript Panel)"]
        Worker["Web Worker\n(Search Engine)"]
        
        Artifacts -->|Load & Cache| IndexedDB[("IndexedDB")]
        IndexedDB --> Worker
        
        MainThread -->|Query Text| Worker
        Worker -->|Load Model| TransformersJS["Transformers.js\n(WebGPU/WASM)"]
        TransformersJS -->|Query Embedding| Search["Cosine Similarity Search"]
        Search -->|Results| MainThread
    end

    Offline_Indexing -->|Static Files| Browser_Runtime

How it's built

The architecture is split into two distinct phases. First, an offline indexing pipeline (Node.js) uses `@xenova/transformers` to generate 384-dimensional sentence embeddings and `hnswlib-node` to build an Approximate Nearest Neighbor index. These artifacts (binary embeddings, index, and metadata) are exported as static files. Second, the browser runtime (Vite + TypeScript) loads these artifacts into IndexedDB. A dedicated Web Worker handles all heavy computation: it initializes the Transformers.js pipeline (preferring WebGPU, falling back to WASM), computes query embeddings on the fly, and performs cosine similarity search against the cached vectors, keeping the main UI thread responsive.

How it runs

sequenceDiagram
    participant User
    participant UI as Main Thread
    participant Worker as Search Worker
    participant DB as IndexedDB
    participant Model as Transformers.js

    Note over UI, DB: Initialization Phase
    UI->>Worker: Post 'init' message with artifacts URL
    Worker->>DB: Fetch artifacts (config, metadata, embeddings)
    DB-->>Worker: Return cached artifacts
    Worker->>Model: Load embedding model (WebGPU/WASM)
    Model-->>Worker: Model ready
    Worker-->>UI: Post 'ready' status

    Note over UI, Model: Query Phase
    User->>UI: Type question & click Search
    UI->>Worker: Post 'search' message (query, topK)
    Worker->>Model: Compute query embedding
    Model-->>Worker: Return Float32 vector
    Worker->>Worker: Perform cosine similarity search
    Worker-->>UI: Post 'results' (matches, latency)
    UI->>User: Display ranked transcript chunks

How to apply & reuse

Use this as a privacy-first retrieval layer for applications handling sensitive transcripts, such as legal depositions, medical records, or internal corporate meetings. It is ideal for scenarios where network latency or data sovereignty prevents the use of cloud-based vector databases. Developers can swap the underlying ONNX model to adjust for different languages or domain-specific terminology without changing the core retrieval logic.

At a glance

CapabilitiesLocal Vector SearchWebGPU InferenceOffline IndexingReal-time Transcription SupportPrivacy-Preserving Retrieval
ComponentsNode.js CLI IndexerVite Web ApplicationSearch Web WorkerHNSW Index BuilderTransformers.js Pipeline
TechTypeScriptViteTransformers.jsWebGPUHNSWIndexedDBONNX
Depends on@xenova/transformershnswlib-nodenode:fsnode:pathnode:module
Integrates withJSONL Data SourcesDownstream LLMsBrowser Storage APIs
PatternsRetrieval-Augmented Generation (RAG)Worker IsolationOffline-First ArchitectureModel QuantizationApproximate Nearest Neighbor (ANN)
Reuse tagslocal-aibrowser-ragwebgpuprivacy-firstvector-searchtypescript

⚠ Needs attention