Browser Voice Agent · davidbmar.com

What it is

A real-time voice conversation agent that runs entirely in the browser. It uses the Web Speech API for transcription, a local LLM via WebGPU for intent classification and response generation, and Text-to-Speech for audio output. It features an 8-stage Finite State Machine (FSM) architecture to manage the conversation loop, including adaptive bias systems and streaming responses.

Features

Local LLM inference via WebGPU with support for 15+ models
Real-time Speech-to-Text using Web Speech API with silence detection
Streaming Text-to-Speech with VITS neural voices on desktop
8-stage FSM architecture for robust conversation management
Adaptive bias system that adjusts verbosity and confidence based on user feedback
Mobile optimizations for iOS Chrome with native API fallbacks

Quickstart

git clone https://github.com/davidbmar/browser-voice-agent-with-TTS-STT-and-OpenSourceLLMs-demo.git
cd browser-voice-agent-with-TTS-STT-and-OpenSourceLLMs-demo
npm install
npm run dev

Architecture

flowchart TD
    User[User] -->|Speaks| Mic[Microphone]
    Mic -->|Audio Stream| STT[Web Speech API]
    STT -->|Transcribed Text| FSM[FSM Loop Controller]
    FSM -->|Intent/Context| LLM[WebLLM Local Model]
    LLM -->|Streamed Tokens| TTS[TTS Engine]
    TTS -->|Audio Output| Speaker[Speaker]
    Speaker -->|Heard by| User
    FSM -->|State Updates| Dashboard[React Dashboard]
    subgraph Browser
        STT
        FSM
        LLM
        TTS
        Dashboard
    end

How it's built

Built with React 19 and TypeScript, bundled with Vite 7. It uses WebLLM for local LLM inference on WebGPU, vits-web for neural TTS on desktop (falling back to native SpeechSynthesis on mobile), and the Web Speech API for STT. The UI is styled with Tailwind CSS v4 and shadcn/ui components.

How it runs

sequenceDiagram
    participant U as User
    participant M as Microphone
    participant STT as Web Speech API
    participant FSM as Loop Controller
    participant LLM as WebLLM
    participant TTS as TTS Engine
    
    U->>M: Speak
    M->>STT: Audio Stream
    STT->>FSM: Transcribed Text
    FSM->>FSM: Detect Signal & Classify Intent
    FSM->>LLM: Generate Response (Stream)
    LLM-->>FSM: Token Stream
    FSM->>TTS: Synthesize Sentence
    TTS-->>U: Audio Output
    FSM->>FSM: Observe Feedback & Update Bias

How to apply & reuse

Clone the repository, install dependencies, and run the development server. Open the application in Chrome or Edge (WebGPU required). The app automatically loads a small Qwen model on startup. For deployment, use the provided shell script to sync to AWS S3 and CloudFront with necessary COOP/COEP headers.

At a glance

CapabilitiesLocal LLM InferenceSpeech RecognitionText-to-SpeechIntent ClassificationAdaptive ConversationOffline Operation

ComponentsLoop Controller (FSM)WebLLM RuntimeVITS TTS EngineWeb Speech API InterfaceReact DashboardBias System

TechReact 19TypeScriptVite 7Tailwind CSS v4WebLLMvits-webWeb Speech APIWebGPU

Depends onChrome 113+Edge 113+WebGPU SupportMicrophone Access1GB+ VRAM

Integrates withAWS S3CloudFrontiOS Native SpeechSynthesisDesktop VITS Voices

PatternsFinite State MachineStreaming ResponseLocal-First ArchitectureAdapter Pattern (TTS)Observer Pattern (Feedback)

Reuse tagsvoice-agentwebgpulocal-llmtypescriptreactoffline-firstspeech-recognitiontext-to-speech