Browser-Native RAG Retrieval Layer

What it is

This project implements a production-shaped prototype for real-time transcription apps where the retrieval pipeline (embedding generation and vector search) executes locally. It consists of an offline Node.js indexer that builds HNSW artifacts from JSONL transcripts and a Vite-based web application that loads these artifacts into IndexedDB. The browser runtime uses Transformers.js with WebGPU acceleration to compute query embeddings and performs cosine similarity search within a Web Worker, ensuring the main thread remains responsive.

Features

Zero network calls at query time; all retrieval happens locally in the browser
WebGPU-accelerated embedding inference with automatic WASM fallback
Offline indexing CLI that generates portable HNSW artifacts from JSONL data
Web Worker isolation for heavy computation to keep UI responsive
Model-agnostic design supporting any ONNX sentence-transformer model
~18ms query latency on Apple Silicon after warmup

Quickstart

git clone https://github.com/davidbmar/browser-RAG-retrieval-realtime-night-index-transformersjs-webgpu-web-app-vite-typescript.git
cd browser-RAG-retrieval-realtime-night-index-transformersjs-webgpu-web-app-vite-typescript
cd indexer-node && npm install
cd ../web-app && npm install
cd ..
cd indexer-node
npm run build-index -- --input ../data --out ../artifacts
cd ../web-app
mkdir -p public/artifacts
cp ../artifacts/* public/artifacts/
npm run dev

Architecture

flowchart TD
    subgraph Offline_Indexing["Offline Indexing (Node.js)"]
        JSONL["JSONL Chunks"] --> Validate["Validate & Normalize"]
        Validate --> Embed["Embed (Transformers.js)"]
        Embed --> HNSW["Build HNSW Index (hnswlib-node)"]
        HNSW --> Artifacts["Export Artifacts\n(config, metadata, embeddings.bin, hnsw.index)"]
    end

    subgraph Browser_Runtime["Browser Runtime (Vite + TS)"]
        MainThread["Main Thread: UI & Transcript"]
        Worker["Web Worker"]
        
        Artifacts -->|Load & Cache| IndexedDB[("IndexedDB")]
        IndexedDB -->|Fetch| Worker
        
        MainThread -->|Query Text| Worker
        Worker -->|Load Model| TransformersJS["Transformers.js\n(WebGPU/WASM)"]
        TransformersJS -->|Query Embedding| Search["Cosine Search"]
        Search -->|Results| MainThread
    end

    Offline_Indexing -->|Artifacts Directory| Browser_Runtime

How it's built

The system is split into two distinct phases. First, a Node.js CLI (`indexer-node`) uses `hnswlib-node` and `@xenova/transformers` to validate chunks, generate 384-dimensional embeddings, and construct an HNSW index, exporting binary artifacts. Second, a TypeScript web app (`web-app`) fetches these artifacts, caches them, and initializes a Web Worker. The worker loads the ONNX model via Transformers.js (preferring WebGPU) and handles search requests by computing query embeddings and running brute-force or HNSW-based cosine similarity against the cached vectors.

How it runs

sequenceDiagram
    participant User as User
    participant UI as Main Thread (UI)
    participant Worker as Search Worker
    participant DB as IndexedDB
    participant Model as Transformers.js (WebGPU)

    Note over UI, DB: Initialization Phase
    UI->>Worker: Post 'init' message with artifacts URL
    Worker->>DB: Fetch artifacts (config, metadata, embeddings, index)
    DB-->>Worker: Return cached artifacts
    Worker->>Model: Load ONNX model
    Model-->>Worker: Model ready
    Worker-->>UI: Post 'ready' message

    Note over UI, Model: Query Phase
    User->>UI: Type query & click Search
    UI->>Worker: Post 'search' message (query, topK, requestId)
    Worker->>Model: Compute embedding for query
    Model-->>Worker: Return query vector
    Worker->>Worker: Perform cosine similarity search
    Worker-->>UI: Post 'results' message (matches, latency)
    UI->>User: Display search results

How to apply & reuse

Use this pattern when building privacy-focused or offline-capable applications requiring semantic search over static datasets (like meeting transcripts or documentation). It eliminates latency and costs associated with hosted embedding APIs and vector databases. The architecture is model-agnostic, allowing you to swap ONNX sentence-transformer models without changing the core retrieval logic.

At a glance

CapabilitiesLocal Semantic SearchWebGPU AccelerationOffline IndexingHNSW Approximate Nearest NeighborWeb Worker ConcurrencyIndexedDB Caching

Componentsindexer-node (CLI)web-app (Vite Frontend)Search WorkerEmbedding PipelineArtifact Exporter

TechTypeScriptViteTransformers.jshnswlib-nodeWebGPUONNX RuntimeIndexedDB

Depends onNode.js 20+Chrome 120+ (for WebGPU)npm 10+

Integrates withLLM Generation Models (downstream)Real-time Transcription Streams

PatternsRetrieval-Augmented Generation (RAG)Offline-First ArchitectureWorker-Based ComputationModel QuantizationVector Similarity Search

Reuse tagsbrowser-nativerag-retrievalwebgputransformers-jshnswoffline-searchtypescriptvite

⚠ Needs attention

unmerged_branch: dependabot/npm_and_yarn/indexer-node/npm_and_yarn-3fc7e1cf9b is 1 commit ahead of the default branch
unmerged_branch: dependabot/npm_and_yarn/web-app/rollup-4.60.4 is 1 commit ahead of the default branch
open_pr: PR #4: Bump rollup from 4.57.1 to 4.60.4 in /web-app
open_pr: PR #3: Bump the npm_and_yarn group across 2 directories with 5 updates