Browser-Text-to-Speech-TTS-Realtime

A client-side, real-time neural text-to-speech engine using WebAssembly and ONNX.

https://github.com/davidbmar/Browser-Text-to-Speech-TTS-Realtime  ·  public  ·  shipped

Browser-Text-to-Speech-TTS-Realtime screenshot

What it is

This project is a high-performance, browser-based text-to-speech system that runs entirely on the client. It utilizes VITS neural TTS models compiled to WebAssembly via ONNX Runtime, eliminating the need for server-side API calls. It features streaming playback where audio begins as soon as the first sentence is synthesized, multi-core processing via WebAssembly threads, and offline support through browser caching.

Features

Quickstart

git clone https://github.com/davidbmar/Browser-Text-to-Speech-TTS-Realtime.git
cd Browser-Text-to-Speech-TTS-Realtime
npm install
npm run dev

Architecture

flowchart TD
    User[User Interface] -->|Input Text| Hook[useTTS Hook]
    Hook -->|Manage State| Engine[TTSEngine Class]
    Engine -->|Split Text| Splitter[Sentence Splitter]
    Splitter -->|Queue Sentences| Generator[Audio Generator]
    Generator -->|Load Model| WASM[VITS WebAssembly Module]
    WASM -->|Inference| ONNX[ONNX Runtime Web]
    Generator -->|Audio Blob| Player[HTML5 Audio Player]
    Player -->|Playback Events| Hook
    WASM -->|Cache| IDB[(IndexedDB)]

How it's built

The application is built with React 18 and TypeScript, bundled with Vite. The core inference engine relies on @diffusionstudio/vits-web, which wraps ONNX Runtime Web. Audio playback is managed via standard HTML5 Audio APIs, while state management for chunk generation and playback status is handled by a custom TTSEngine class and the useTTS React hook. UI components are styled with Tailwind CSS and shadcn/ui.

How it runs

sequenceDiagram
    participant U as User
    participant H as useTTS Hook
    participant E as TTSEngine
    participant W as VITS WASM
    participant A as Audio Element

    U->>H: speak(text)
    H->>E: processText(text)
    E->>E: splitIntoSentences()
    loop For each sentence
        E->>W: generateAudio(sentence)
        W->>W: ONNX Inference
        W-->>E: return Audio Blob
        E->>E: updateChunkStatus(ready)
    end
    E->>A: play(firstChunk)
    A-->>H: onplay/onended events
    H-->>U: update UI state

How to apply & reuse

Integrate the useTTS hook into any React component to enable local voice synthesis. It is suitable for privacy-sensitive applications requiring offline capability, accessibility tools needing natural-sounding voices without cloud dependencies, or educational platforms wanting to reduce API costs. Developers can customize voice IDs, speed, and concurrency limits via the hook's options.

At a glance

CapabilitiesNeural TTS InferenceWebAssembly ExecutionStreaming Audio PlaybackOffline Model CachingMulti-threaded Processing
ComponentsuseTTS HookTTSEngine ClassVITS WebAssembly ModuleSentence SplitterAudio Player Manager
TechTypeScriptReact 18ViteONNX Runtime WebWebAssemblyTailwind CSSshadcn/ui
Depends on@diffusionstudio/vits-webonnxruntime-webreacttypescriptvite
Integrates withBrowser IndexedDBHTML5 Audio APIWeb Workers
PatternsCustom React HookClass-based EngineStreaming Data ProcessingLazy Loading ModelsState Management via Callbacks
Reuse tagsttswebassemblyonnxreact-hookoffline-firstprivacy-focused

⚠ Needs attention