Voice Print

Real-time speaker diarization and voice fingerprinting system using multiple embedding models.

https://github.com/davidbmar/voice-print  ·  public  ·  shipped

Voice Print screenshot

What it is

A full-stack application that identifies who is speaking in live audio streams or recorded files. It combines browser-native Whisper transcription (via WebGPU/WASM) with a Python backend for speaker embedding extraction and incremental clustering, allowing users to benchmark different AI models side-by-side without sending audio to external cloud APIs.

Features

Quickstart

git clone https://github.com/davidbmar/voice-print.git
cd voice-print
python3 -m venv .venv
source .venv/bin/activate
pip install -r backend/requirements.txt
bash scripts/serve.sh

Architecture

flowchart TD
    subgraph Client [Browser]
        Mic[Microphone Input]
        AW[AudioWorklet]
        UI[Frontend UI]
        Whisper[Whisper WASM/WebGPU]
    end
    subgraph Server [FastAPI Backend]
        WS[WebSocket Handler]
        RB[Ring Buffer 30s]
        SBD[Sentence Boundary Detection]
        Emb[Embedding Extractor]
        Cluster[Speaker Clusterer]
        ModelReg[Model Registry]
    end
    Mic -->|16kHz PCM| AW
    AW -->|Binary Frames| WS
    WS --> RB
    RB --> SBD
    SBD -->|Audio Segment| Emb
    Emb -->|Load Model| ModelReg
    Emb -->|Vector| Cluster
    Cluster -->|Speaker ID| WS
    WS -->|JSON Label| UI
    UI -->|Click Replay| Whisper

How it's built

The frontend uses vanilla HTML/JS with AudioWorklets for low-latency microphone capture and WebSocket binary frames for streaming. The backend is a FastAPI server managing a ring buffer, Faster Whisper for transcription, and a registry of five speaker embedding models (WeSpeaker, CAM++, Resemblyzer, SpeechBrain ECAPA, Pyannote). Clustering is performed incrementally using cosine similarity against speaker centroids.

How it runs

sequenceDiagram
    participant Browser
    participant AudioWorklet
    participant FastAPI
    participant RingBuffer
    participant Embedder
    participant Clusterer
    Browser->>AudioWorklet: Start Mic Capture
    loop Every 100ms
        AudioWorklet->>FastAPI: Send Binary PCM Chunk (WS)
        FastAPI->>RingBuffer: Append Audio Data
        RingBuffer->>FastAPI: Check Sentence Boundary
        alt Sentence Complete
            FastAPI->>Embedder: Extract Audio Segment
            Embedder->>Embedder: Compute Fbank/Features
            Embedder->>Embedder: Run ONNX/PyTorch Model
            Embedder->>Clusterer: Return Embedding Vector
            Clusterer->>Clusterer: Cosine Similarity Check
            Clusterer->>FastAPI: Assign Speaker ID
            FastAPI->>Browser: JSON {text, speaker_id}
        end
    end

How to apply & reuse

Use this project as a reference for building real-time audio analysis pipelines. It demonstrates how to handle binary WebSocket streams, implement lazy-loading model registries in Python, perform incremental clustering without pre-defined speaker counts, and run heavy inference tasks (Whisper) entirely in the browser to reduce server load.

At a glance

CapabilitiesReal-time audio streamingSpeaker diarizationVoice fingerprintingOffline transcriptionModel benchmarkingIncremental clustering
ComponentsFastAPI BackendAudioWorklet FrontendSpeaker ClustererModel RegistryRing BufferWhisper WASM
TechPython 3.11+FastAPIWebSocketsONNX RuntimePyTorchWebGPUWebAssemblyNumPySciPy
Depends onfaster-whisperonnxruntimetorchscipynumpykaldi-native-fbankpyannote.audiospeechbrainresemblyzer
Integrates withHugging Face ModelsBrowser Microphone APILocal File System
PatternsLazy LoadingProducer-ConsumerStrategy Pattern (Models)Incremental ClusteringClient-Side Inference
Reuse tagsaudio-processingspeaker-diarizationreal-time-streamingmachine-learningwebsocketsfastapionnxvoice-id

Repo hygiene

✓ all on main — nothing unmerged.