Voice Base — Browser-to-Server Voice UI

What it is

A real-time voice interface that runs entirely in the browser and a local Python server. It captures audio via WebRTC, performs Voice Activity Detection (VAD), transcribes speech locally using faster-whisper, sends text to Anthropic's Claude API, synthesizes response audio locally using Piper TTS, and streams it back to the browser. It supports barge-in (interrupting the assistant) and runtime configuration via an admin panel.

Features

Real-time WebRTC audio streaming with low latency via aiortc
Local Speech-to-Text using faster-whisper (tiny/base/small/medium models)
Local Text-to-Speech using Piper ONNX voices with auto-download
Barge-in support allowing users to interrupt assistant speech mid-sentence
Runtime Admin Panel for tuning VAD, switching models, and previewing voices
Sentence-level streaming: plays first sentence while synthesizing subsequent ones

Quickstart

git clone https://github.com/davidbmar/voice-only-UI-STT-TTS-base.git
cd voice-only-UI-STT-TTS-base
python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
cp .env.example .env
# Edit .env to set ANTHROPIC_API_KEY
python server.py

Architecture

flowchart TD
    subgraph Client [Browser]
        Mic[Microphone]
        WebRTC_Client[WebRTC Peer]
        Speaker[Speaker]
    end

    subgraph Server [Python Server]
        Signaling[FastAPI / WebSocket]
        Peer[aiortc PeerConnection]
        VAD[Voice Activity Detector]
        STT[faster-whisper]
        LLM[Claude API]
        TTS[Piper TTS]
        AudioQueue[Audio Queue]
    end

    Mic -->|Audio Frames| WebRTC_Client
    WebRTC_Client -->|Opus Audio| Peer
    Peer -->|PCM Audio| VAD
    VAD -->|Speech Detected| STT
    STT -->|Transcript| LLM
    LLM -->|Response Text| TTS
    TTS -->|PCM Audio| AudioQueue
    AudioQueue -->|Opus Audio| Peer
    Peer -->|Audio Stream| WebRTC_Client
    WebRTC_Client --> Speaker
    
    Signaling <-->|SDP Offer/Answer| WebRTC_Client
    Signaling -->|Controls| VAD
    Signaling -->|Controls| STT
    Signaling -->|Controls| TTS

How it's built

Built with Python 3.11+ using FastAPI for the web server and aiortc for WebRTC handling. Speech-to-Text uses faster-whisper (CPU-based), Text-to-Speech uses Piper ONNX models, and the LLM integration targets Anthropic's Claude. The frontend is a simple HTML/JS client handling getUserMedia and WebRTC peer connection.

How it runs

sequenceDiagram
    participant Browser
    participant Server
    participant Whisper as faster-whisper
    participant Claude as Claude API
    participant Piper as Piper TTS

    Browser->>Server: HTTP GET /
    Server-->>Browser: Serve HTML/JS UI
    Browser->>Server: WebSocket Connect /ws
    Browser->>Server: WebRTC Offer (SDP)
    Server-->>Browser: WebRTC Answer (SDP)
    Browser->>Server: Audio Frames (Opus)
    Server->>Server: VAD Detection
    alt Speech Detected
        Server->>Whisper: Transcribe Audio
        Whisper-->>Server: Text Transcript
        Server->>Claude: Send Prompt
        Claude-->>Server: Response Text
        Server->>Piper: Synthesize Speech
        Piper-->>Server: Audio Chunks
        loop Streaming
            Server->>Browser: Audio Frames (Opus)
            Browser->>Server: Barge-in Audio (if interrupting)
        end
    end

How to apply & reuse

Use as a foundational template for building voice-enabled AI assistants without complex state machines. Ideal for developers needing a low-latency, privacy-conscious (local STT/TTS) voice interface prototype that can be extended with custom logic or different LLM providers.

At a glance

CapabilitiesReal-time voice conversationLocal speech recognitionLocal speech synthesisVoice activity detectionSpeech interruption (barge-in)Runtime model switchingAdmin configuration interface

Componentsserver.pyconfig.pyengine/stt.pyengine/tts.pyengine/adapter.pyengine/types.pystatic/index.html

TechPython 3.11+FastAPIaiortcfaster-whisperPiper TTSWebRTCONNX RuntimeScipyNumpy

Depends onAnthropic API KeyInternet connection for initial model downloadsModern Web Browser with WebRTC support

Integrates withAnthropic Claude APIHuggingFace (for Piper voices)Google STUN servers (default)Twilio TURN servers (optional)

PatternsClient-Server ArchitectureWebRTC Media StreamingLazy Loading ModelsProducer-Consumer (Audio Queue)Event-Driven VAD

Reuse tagsvoice-uiwebrtcsttttsllm-integrationreal-time-audiopython-backend