Iris Kade v2: Local Browser AI with Intent Workflow

What it is

Iris Kade is a privacy-first, browser-native AI assistant that runs entirely on client-side hardware using WebGPU. It combines local LLM inference (via web-llm) with a multi-lane RAG pipeline and a custom declarative workflow engine. The system uses vector embeddings to classify user intents and route conversations through defined state machines, enabling complex, context-aware interactions without sending data to external servers.

Features

100% local execution using WebGPU with no network calls or API keys
Visual workflow editor for defining stateful, multi-turn conversation graphs
Embedder-based intent classification with vector and lexical reranking
Streaming TTS and STT for hands-free, real-time voice interaction
Multi-lane RAG pipeline retrieving from persona, playbook, and lore contexts
Built-in diagnostic harness for one-click conversation testing and analysis

Quickstart

git clone https://github.com/davidbmar/speaker-generation-version-2-intents
cd speaker-generation-version-2-intents/web-app
npm install
npm run dev

Architecture

flowchart TD
    User[User Input] -->|Voice/Text| STT[Web Speech API]
    STT --> FSM[Conversation FSM]
    FSM -->|IDLE to PROCESSING| Pipeline[RAG Pipeline]
    Pipeline -->|Retrieve & Rerank| Context[Context Composer]
    Context --> LLM[Local LLM via web-llm]
    LLM -->|Stream Tokens| Buffer[Sentence Buffer]
    Buffer --> TTS[vits-web TTS]
    TTS --> Audio[Audio Output]
    FSM -->|Interrupt| Pipeline
    FSM -->|Reset| IDLE((IDLE State))
    subgraph Local Browser
        STT
        FSM
        Pipeline
        LLM
        TTS
    end

How it's built

The application is built with TypeScript and runs in the browser. It leverages web-llm for local LLM inference, vits-web for streaming text-to-speech, and the Web Speech API for speech-to-text. A custom workflow engine parses JSONL definitions into interactive state graphs. Vector search is handled via Web Workers using ONNX embeddings, while the UI manages a 4-state conversation FSM (IDLE, PROCESSING, SPEAKING, INTERRUPTED).

How it runs

sequenceDiagram
    participant U as User
    participant UI as UI/FSM
    participant W as Workflow/Intent
    participant RAG as RAG Pipeline
    participant LLM as WebLLM
    participant TTS as VITS Web
    
    U->>UI: Speak/Type Input
    UI->>W: Classify Intent
    W->>W: Match State/Transition
    W->>RAG: Retrieve Context
    RAG->>RAG: Vector + Lexical Rerank
    RAG->>LLM: Compose Prompt
    LLM->>LLM: Generate Tokens
    LLM->>TTS: Stream Sentences
    TTS->>U: Play Audio
    U->>UI: Interrupt (Optional)
    UI->>LLM: Abort Generation
    UI->>TTS: Stop Audio
    UI->>W: Reset to IDLE

How to apply & reuse

Use this project as a foundation for building private, offline-capable AI assistants or interactive narrative engines. The modular workflow system allows developers to define complex conversation logic declaratively, while the RAG pipeline can be repurposed for domain-specific knowledge retrieval. It serves as a reference implementation for high-performance WebGPU AI applications.

At a glance

CapabilitiesLocal LLM InferenceIntent ClassificationVisual Workflow EditingVector Search RAGSpeech SynthesisSpeech RecognitionState Machine Management

Componentsweb-llmvits-webWeb Speech APIWorkflow ManagerIntent ClassifierSearch WorkerCode View Renderer

TechTypeScriptWebGPUONNX RuntimeViteMermaidJSONL

Depends onChrome 113+Edge 113+WebGPU SupportNode.js (for dev)

Integrates withWeb Audio APIWeb WorkersFile System Access (for packs)

PatternsFinite State MachineRetrieval-Augmented GenerationModel-View-ControllerWorker-based OffloadingDeclarative Configuration

Reuse tagsprivacy-firstoffline-aiwebgpulocal-llmvoice-interfaceworkflow-engine

⚠ Needs attention

unmerged_branch: dependabot/npm_and_yarn/web-app/npm_and_yarn-6a3a1fcd0f is 1 commit ahead of the default branch
open_pr: PR #1: Bump the npm_and_yarn group across 2 directories with 6 updates