transcription-realtime-whisper

Production-ready real-time speech transcription system supporting WhisperLive edge proxy and NVIDIA Riva enterprise architectures.

https://github.com/davidbmar/transcription-realtime-whisper  ·  public  ·  shipped

What it is

A deployment automation framework for real-time speech-to-text services. It offers two distinct backends: an open-source WhisperLive setup using faster-whisper on GPU instances behind a Caddy reverse proxy, and an enterprise-grade NVIDIA Riva implementation using Conformer-CTC models. The system handles audio streaming from browser clients via WebSocket, manages SSL termination at the edge, and provides systemd-managed services for high availability.

Features

Quickstart

git clone https://github.com/davidbmar/transcription-realtime-whisper.git
cd transcription-realtime-whisper
./scripts/010-setup-build-box.sh
./scripts/020-deploy-gpu-instance.sh
./scripts/030-configure-gpu-security.sh
./scripts/305-setup-whisperlive-edge.sh
./scripts/310-configure-whisperlive-gpu.sh
./scripts/040-configure-edge-security.sh
./scripts/320-update-edge-clients.sh
./scripts/315-test-whisperlive-connection.sh

Architecture

flowchart TD
    Client[Browser Client] -->|HTTPS/WSS Audio Stream| Edge[Edge EC2: Caddy Proxy]
    Edge -->|Internal WebSocket| GPU[GPU EC2: WhisperLive/Riva]
    GPU -->|Transcription Text| Edge
    Edge -->|Transcription Text| Client
    subgraph Infrastructure
        Edge
        GPU
    end
    subgraph Models
        Whisper[faster-whisper]
        Riva[NVIDIA Riva Conformer-CTC]
    end
    GPU --> Whisper
    GPU --> Riva

How it's built

The project is primarily built using Bash scripts for infrastructure provisioning and service configuration on AWS EC2. The WhisperLive architecture uses Python (faster-whisper) for inference, Caddy for reverse proxying/SSL, and standard WebSockets for client communication. The NVIDIA Riva architecture utilizes Docker containers for the Riva server and a Python-based WebSocket-to-gRPC bridge. Authentication and session management components (found in the audio-api subdirectory) are implemented as Node.js/TypeScript AWS Lambda functions interacting with S3 and Cognito.

How it runs

sequenceDiagram
    participant Browser as Browser Client
    participant Caddy as Edge EC2 (Caddy)
    participant Service as GPU EC2 (WhisperLive/Riva)
    
    Browser->>Caddy: Connect WSS (Audio Stream)
    Caddy->>Service: Forward WebSocket Connection
    Service-->>Caddy: Acknowledge Connection
    Caddy-->>Browser: Connection Established
    
    loop Real-time Transcription
        Browser->>Caddy: Send Audio Chunk (PCM)
        Caddy->>Service: Forward Audio Chunk
        Service->>Service: Process ASR Inference
        Service-->>Caddy: Return Partial/Final Transcript
        Caddy-->>Browser: Return Transcript JSON
    end

How to apply & reuse

Use this repository to deploy a scalable, low-latency transcription service on AWS. It is suitable for applications requiring live captioning, meeting transcription, or voice command interfaces where data privacy (self-hosted) or enterprise accuracy (Riva) is critical. The modular script structure allows for selective deployment of either the lightweight WhisperLive edge or the heavy-duty Riva backend.

At a glance

CapabilitiesReal-time Speech RecognitionWebSocket StreamingSSL/TLS TerminationGPU AccelerationAutomated DeploymentService Health Monitoring
ComponentsCaddy Reverse ProxyWhisperLive ServerNVIDIA Riva ServerWebSocket Bridge (Python)Deployment Scripts (Bash)Browser Client UI
TechBashPythonNode.jsTypeScriptDockerWebSocketsgRPCAWS EC2
Depends onAWS CLISSH Key PairNVIDIA Drivers (for Riva)Docker EngineNode.js (for API components)
Integrates withAWS CognitoAWS S3NVIDIA NGCChrome/Firefox/Edge Browsers
PatternsEdge ComputingReverse ProxyMicroservicesEvent-Driven ArchitectureInfrastructure as Code (Scripted)
Reuse tagsspeech-to-textreal-timeaws-deploymentwhispernvidia-rivawebsocket-streaming

⚠ Needs attention