Production-ready real-time speech transcription system supporting WhisperLive edge proxy and NVIDIA Riva enterprise architectures.
https://github.com/davidbmar/transcription-realtime-whisper · public · shipped
A deployment automation framework for real-time speech-to-text services. It offers two distinct backends: an open-source WhisperLive setup using faster-whisper on GPU instances behind a Caddy reverse proxy, and an enterprise-grade NVIDIA Riva implementation using Conformer-CTC models. The system handles audio streaming from browser clients via WebSocket, manages SSL termination at the edge, and provides systemd-managed services for high availability.
git clone https://github.com/davidbmar/transcription-realtime-whisper.git cd transcription-realtime-whisper ./scripts/010-setup-build-box.sh ./scripts/020-deploy-gpu-instance.sh ./scripts/030-configure-gpu-security.sh ./scripts/305-setup-whisperlive-edge.sh ./scripts/310-configure-whisperlive-gpu.sh ./scripts/040-configure-edge-security.sh ./scripts/320-update-edge-clients.sh ./scripts/315-test-whisperlive-connection.sh
flowchart TD
Client[Browser Client] -->|HTTPS/WSS Audio Stream| Edge[Edge EC2: Caddy Proxy]
Edge -->|Internal WebSocket| GPU[GPU EC2: WhisperLive/Riva]
GPU -->|Transcription Text| Edge
Edge -->|Transcription Text| Client
subgraph Infrastructure
Edge
GPU
end
subgraph Models
Whisper[faster-whisper]
Riva[NVIDIA Riva Conformer-CTC]
end
GPU --> Whisper
GPU --> Riva
The project is primarily built using Bash scripts for infrastructure provisioning and service configuration on AWS EC2. The WhisperLive architecture uses Python (faster-whisper) for inference, Caddy for reverse proxying/SSL, and standard WebSockets for client communication. The NVIDIA Riva architecture utilizes Docker containers for the Riva server and a Python-based WebSocket-to-gRPC bridge. Authentication and session management components (found in the audio-api subdirectory) are implemented as Node.js/TypeScript AWS Lambda functions interacting with S3 and Cognito.
sequenceDiagram
participant Browser as Browser Client
participant Caddy as Edge EC2 (Caddy)
participant Service as GPU EC2 (WhisperLive/Riva)
Browser->>Caddy: Connect WSS (Audio Stream)
Caddy->>Service: Forward WebSocket Connection
Service-->>Caddy: Acknowledge Connection
Caddy-->>Browser: Connection Established
loop Real-time Transcription
Browser->>Caddy: Send Audio Chunk (PCM)
Caddy->>Service: Forward Audio Chunk
Service->>Service: Process ASR Inference
Service-->>Caddy: Return Partial/Final Transcript
Caddy-->>Browser: Return Transcript JSON
end
Use this repository to deploy a scalable, low-latency transcription service on AWS. It is suitable for applications requiring live captioning, meeting transcription, or voice command interfaces where data privacy (self-hosted) or enterprise accuracy (Riva) is critical. The modular script structure allows for selective deployment of either the lightweight WhisperLive edge or the heavy-duty Riva backend.