Voice Base — Browser-to-Server Voice UI

A minimal, self-contained voice conversation system connecting browser microphone to Claude via WebRTC, using local Whisper for STT and Piper for TTS.

https://github.com/davidbmar/voice-only-UI-STT-TTS-base  ·  public  ·  shipped

Voice Base — Browser-to-Server Voice UI screenshot

What it is

A real-time voice interface that runs entirely in the browser and a local Python server. It captures audio via WebRTC, performs Voice Activity Detection (VAD), transcribes speech locally using faster-whisper, sends text to Anthropic's Claude API, synthesizes response audio locally using Piper TTS, and streams it back to the browser. It supports barge-in (interrupting the assistant) and runtime configuration via an admin panel.

Features

Quickstart

git clone https://github.com/davidbmar/voice-only-UI-STT-TTS-base.git
cd voice-only-UI-STT-TTS-base
python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
cp .env.example .env
# Edit .env to set ANTHROPIC_API_KEY
python server.py

Architecture

flowchart TD
    subgraph Client [Browser]
        Mic[Microphone]
        WebRTC_Client[WebRTC Peer]
        Speaker[Speaker]
    end

    subgraph Server [Python Server]
        Signaling[FastAPI / WebSocket]
        Peer[aiortc PeerConnection]
        VAD[Voice Activity Detector]
        STT[faster-whisper]
        LLM[Claude API]
        TTS[Piper TTS]
        AudioQueue[Audio Queue]
    end

    Mic -->|Audio Frames| WebRTC_Client
    WebRTC_Client -->|Opus Audio| Peer
    Peer -->|PCM Audio| VAD
    VAD -->|Speech Detected| STT
    STT -->|Transcript| LLM
    LLM -->|Response Text| TTS
    TTS -->|PCM Audio| AudioQueue
    AudioQueue -->|Opus Audio| Peer
    Peer -->|Audio Stream| WebRTC_Client
    WebRTC_Client --> Speaker
    
    Signaling <-->|SDP Offer/Answer| WebRTC_Client
    Signaling -->|Controls| VAD
    Signaling -->|Controls| STT
    Signaling -->|Controls| TTS

How it's built

Built with Python 3.11+ using FastAPI for the web server and aiortc for WebRTC handling. Speech-to-Text uses faster-whisper (CPU-based), Text-to-Speech uses Piper ONNX models, and the LLM integration targets Anthropic's Claude. The frontend is a simple HTML/JS client handling getUserMedia and WebRTC peer connection.

How it runs

sequenceDiagram
    participant Browser
    participant Server
    participant Whisper as faster-whisper
    participant Claude as Claude API
    participant Piper as Piper TTS

    Browser->>Server: HTTP GET /
    Server-->>Browser: Serve HTML/JS UI
    Browser->>Server: WebSocket Connect /ws
    Browser->>Server: WebRTC Offer (SDP)
    Server-->>Browser: WebRTC Answer (SDP)
    Browser->>Server: Audio Frames (Opus)
    Server->>Server: VAD Detection
    alt Speech Detected
        Server->>Whisper: Transcribe Audio
        Whisper-->>Server: Text Transcript
        Server->>Claude: Send Prompt
        Claude-->>Server: Response Text
        Server->>Piper: Synthesize Speech
        Piper-->>Server: Audio Chunks
        loop Streaming
            Server->>Browser: Audio Frames (Opus)
            Browser->>Server: Barge-in Audio (if interrupting)
        end
    end

How to apply & reuse

Use as a foundational template for building voice-enabled AI assistants without complex state machines. Ideal for developers needing a low-latency, privacy-conscious (local STT/TTS) voice interface prototype that can be extended with custom logic or different LLM providers.

At a glance

CapabilitiesReal-time voice conversationLocal speech recognitionLocal speech synthesisVoice activity detectionSpeech interruption (barge-in)Runtime model switchingAdmin configuration interface
Componentsserver.pyconfig.pyengine/stt.pyengine/tts.pyengine/adapter.pyengine/types.pystatic/index.html
TechPython 3.11+FastAPIaiortcfaster-whisperPiper TTSWebRTCONNX RuntimeScipyNumpy
Depends onAnthropic API KeyInternet connection for initial model downloadsModern Web Browser with WebRTC support
Integrates withAnthropic Claude APIHuggingFace (for Piper voices)Google STUN servers (default)Twilio TURN servers (optional)
PatternsClient-Server ArchitectureWebRTC Media StreamingLazy Loading ModelsProducer-Consumer (Audio Queue)Event-Driven VAD
Reuse tagsvoice-uiwebrtcsttttsllm-integrationreal-time-audiopython-backend

Repo hygiene

✓ all on main — nothing unmerged.