WhisperLive · davidbmar.com

What it is

WhisperLive is a Python-based server application that exposes a WebSocket API for real-time audio transcription. It supports multiple inference backends (Faster-Whisper, TensorRT, OpenVINO) to optimize performance on different hardware. The project includes browser extensions (Chrome/Firefox) and Python clients that capture microphone or tab audio, preprocess it into 16kHz mono chunks, and stream it to the server for near-instant text output.

Features

Real-time transcription via WebSocket streaming
Supports Faster-Whisper, TensorRT, and OpenVINO backends
Browser extensions for Chrome and Firefox with tab/mic capture
Voice Activity Detection (VAD) to reduce processing load
Multi-client support with configurable connection limits
SRT caption export functionality

Quickstart

bash scripts/setup.sh
pip install whisper-live
python3 run_server.py --port 9090 --backend faster_whisper

Architecture

flowchart TD
    Client[Browser/Python Client] -->|WebSocket Binary Audio| Server[WhisperLive Server]
    Server --> Backend{Inference Backend}
    Backend -->|CPU| FW[Faster-Whisper]
    Backend -->|GPU NVIDIA| TRT[TensorRT-LLM]
    Backend -->|Intel HW| OV[OpenVINO]
    FW --> Text[Transcribed Text]
    TRT --> Text
    OV --> Text
    Text -->|WebSocket JSON| Client

How it's built

The core server is built in Python using `websockets` for communication and `pyaudio`/`numpy` for audio handling. It integrates with ASR engines like `faster-whisper`, `tensorrt_llm`, or `openvino`. Clients are implemented as JavaScript Browser Extensions using the Web Audio API and AudioWorklets for efficient client-side resampling and buffering before sending binary data over WebSockets.

How it runs

sequenceDiagram
    participant C as Client (JS/Python)
    participant S as WhisperLive Server
    participant B as ASR Backend
    C->>S: Connect WebSocket
    S-->>C: Connection Established
    loop Audio Stream
        C->>C: Capture & Resample to 16kHz Mono
        C->>S: Send Audio Chunk (Binary)
        S->>B: Process Chunk
        B-->>S: Return Transcript Segment
        S-->>C: Send Transcript (JSON)
    end
    C->>S: Close Connection

How to apply & reuse

Deploy the server on a machine with GPU acceleration (NVIDIA for TensorRT, Intel for OpenVINO) or CPU (Faster-Whisper). Connect browser extensions to transcribe meetings, lectures, or media playback directly in the browser. Integrate the Python client into desktop applications requiring live captioning.

At a glance

CapabilitiesLive Speech-to-TextAudio File TranscriptionMulti-language SupportTranslation to EnglishVoice Activity Detection

Componentsrun_server.pywhisper_live/server.pyAudio-Transcription-ChromeAudio-Transcription-Firefoxwhisper_live/client.py

TechPythonJavaScriptWebSocketsPyAudioNumPyFaster-WhisperTensorRTOpenVINO

Depends ontorchfaster-whisperwebsocketspyaudioonnxruntime

Integrates withChrome BrowserFirefox BrowserNVIDIA GPUsIntel CPUs/GPUs

PatternsClient-ServerStreamingProducer-ConsumerPlugin Architecture

Reuse tagsreal-timeasrwebsockettranscriptionaudio-processing