Browser Voice Agent

A fully browser-native voice agent using local WebGPU LLMs, TTS, and STT with no server requirements.

https://github.com/davidbmar/browser-voice-agent-with-TTS-STT-and-OpenSourceLLMs-demo  ·  public  ·  shipped

Browser Voice Agent screenshot

What it is

A real-time voice conversation agent that runs entirely in the browser. It uses the Web Speech API for transcription, a local LLM via WebGPU for intent classification and response generation, and Text-to-Speech for audio output. It features an 8-stage Finite State Machine (FSM) architecture to manage the conversation loop, including adaptive bias systems and streaming responses.

Features

Quickstart

git clone https://github.com/davidbmar/browser-voice-agent-with-TTS-STT-and-OpenSourceLLMs-demo.git
cd browser-voice-agent-with-TTS-STT-and-OpenSourceLLMs-demo
npm install
npm run dev

Architecture

flowchart TD
    User[User] -->|Speaks| Mic[Microphone]
    Mic -->|Audio Stream| STT[Web Speech API]
    STT -->|Transcribed Text| FSM[FSM Loop Controller]
    FSM -->|Intent/Context| LLM[WebLLM Local Model]
    LLM -->|Streamed Tokens| TTS[TTS Engine]
    TTS -->|Audio Output| Speaker[Speaker]
    Speaker -->|Heard by| User
    FSM -->|State Updates| Dashboard[React Dashboard]
    subgraph Browser
        STT
        FSM
        LLM
        TTS
        Dashboard
    end

How it's built

Built with React 19 and TypeScript, bundled with Vite 7. It uses WebLLM for local LLM inference on WebGPU, vits-web for neural TTS on desktop (falling back to native SpeechSynthesis on mobile), and the Web Speech API for STT. The UI is styled with Tailwind CSS v4 and shadcn/ui components.

How it runs

sequenceDiagram
    participant U as User
    participant M as Microphone
    participant STT as Web Speech API
    participant FSM as Loop Controller
    participant LLM as WebLLM
    participant TTS as TTS Engine
    
    U->>M: Speak
    M->>STT: Audio Stream
    STT->>FSM: Transcribed Text
    FSM->>FSM: Detect Signal & Classify Intent
    FSM->>LLM: Generate Response (Stream)
    LLM-->>FSM: Token Stream
    FSM->>TTS: Synthesize Sentence
    TTS-->>U: Audio Output
    FSM->>FSM: Observe Feedback & Update Bias

How to apply & reuse

Clone the repository, install dependencies, and run the development server. Open the application in Chrome or Edge (WebGPU required). The app automatically loads a small Qwen model on startup. For deployment, use the provided shell script to sync to AWS S3 and CloudFront with necessary COOP/COEP headers.

At a glance

CapabilitiesLocal LLM InferenceSpeech RecognitionText-to-SpeechIntent ClassificationAdaptive ConversationOffline Operation
ComponentsLoop Controller (FSM)WebLLM RuntimeVITS TTS EngineWeb Speech API InterfaceReact DashboardBias System
TechReact 19TypeScriptVite 7Tailwind CSS v4WebLLMvits-webWeb Speech APIWebGPU
Depends onChrome 113+Edge 113+WebGPU SupportMicrophone Access1GB+ VRAM
Integrates withAWS S3CloudFrontiOS Native SpeechSynthesisDesktop VITS Voices
PatternsFinite State MachineStreaming ResponseLocal-First ArchitectureAdapter Pattern (TTS)Observer Pattern (Feedback)
Reuse tagsvoice-agentwebgpulocal-llmtypescriptreactoffline-firstspeech-recognitiontext-to-speech

Repo hygiene

✓ all on main — nothing unmerged.