voice-frontend-modules

What it is

A modular toolkit for building real-time voice applications. It decouples connectivity (WebRTC/TURN), security (JWT/Cloudflare Access), and AI processing (STT/TTS/LLM) into three independent packages. Developers can mix and match components or swap the reference 'engine-starter' with production-grade providers while maintaining a consistent interface.

Features

Independent WebRTC transport handling with TURN/STUN support
Pluggable edge authentication for HTTP and WebSockets (Cloudflare, Google JWT)
Reference STT/TTS/LLM engine with swappable ABC-based architecture
Browser-ready JS client for WebRTC signaling and audio capture
Support for barge-in (interruptible playback) and VAD tuning

Architecture

flowchart TD
    Client[Browser/JS Client] -->|WebSocket| Signaling[Signaling Server]
    Client -->|WebRTC Media| Session[WebRTC Session]
    Signaling -->|Auth Check| Auth[Edge Auth Middleware]
    Auth -->|Validate| Providers[Auth Providers]
    Session -->|Audio In| STT[STT Provider]
    STT -->|Text| LLM[LLM Provider]
    LLM -->|Text Response| TTS[TTS Provider]
    TTS -->|Audio Out| Session
    Session -->|Media Stream| Client
    subgraph Infrastructure
        Signaling
        Auth
        Session
    end
    subgraph AI Engine
        STT
        LLM
        TTS
    end

How it's built

Python 3.9+ backend using FastAPI for signaling and WebSocket handling. The transport layer manages WebRTC sessions, ICE gathering, and TURN credentials. Edge-auth provides middleware for HTTP and WebSocket authentication. The engine layer defines Abstract Base Classes (ABCs) for speech-to-text, text-to-speech, and LLM interactions, implemented by reference starters (Whisper, Piper, Ollama) that can be replaced by custom implementations.

How it runs

sequenceDiagram
    participant Browser as Browser Client
    participant FastAPI as FastAPI App
    participant Auth as Edge Auth
    participant Signal as Signaling Server
    participant Session as WebRTC Session
    participant Engine as AI Engine (STT/LLM/TTS)

    Browser->>FastAPI: GET / (Load JS Client)
    FastAPI-->>Browser: HTML/JS
    Browser->>FastAPI: WebSocket Connect /ws
    FastAPI->>Auth: authenticate_ws
    Auth-->>FastAPI: AuthResult
    alt Authorized
        FastAPI->>Signal: handle(websocket)
        Signal->>Session: Create WebRTC Session
        Session->>Browser: SDP Offer/Answer
        Browser->>Session: Media Stream
        loop Voice Interaction
            Session->>Engine: listen(stt)
            Engine-->>Session: Utterance Text
            Session->>Engine: llm.chat(utterance)
            Engine-->>Session: Response Text
            Session->>Engine: speak(response, tts)
            Engine-->>Session: Audio Bytes
            Session->>Browser: Send Audio Track
        end
    else Unauthorized
        FastAPI->>Browser: Close Connection (4001)
    end

How to apply & reuse

Install the required packages via pip. Initialize a FastAPI app and mount the signaling server. Implement your conversation logic in an async handler that receives a WebRTCSession object. Use the session's speak() and listen() methods with your chosen STT/TTS providers. Add auth middleware if securing the endpoint.

At a glance

CapabilitiesWebRTC SignalingTURN Credential ManagementWebSocket AuthenticationSpeech-to-Text IntegrationText-to-Speech SynthesisLLM Chat OrchestrationVoice Activity DetectionInterruptible Playback

Componentstransportedge-authengine-starterSignalingServerWebRTCSessionAuthProviderCompositeProviderStarterSTTStarterTTSStarterLLM

TechPythonFastAPIWebRTCWebSocketWhisperPiperOllamaJavaScript

Depends onfastapiuvicornwebsocketsaiortctwiliokokoro-onnxopenai-whisper

Integrates withCloudflare AccessGoogle JWTTwilio TURNCustom STT ProvidersCustom TTS ProvidersCustom LLM Providers

PatternsDependency InjectionAdapter PatternMiddleware PatternAsync/AwaitAbstract Base Classes

Reuse tagsvoice-aiwebrtcauth-middlewarestt-ttsfastapireal-time