Deployable real-time speech-to-text service using OpenAI Whisper on RunPod GPU infrastructure.
https://github.com/davidbmar/whisperlive-runpod · public · shipped
A deployment wrapper and configuration toolkit for running Collabora's WhisperLive on RunPod. It provides scripts to build optimized Docker images (slim or full with diarization), push them to a registry, and deploy them to RunPod GPU instances. The system exposes a WebSocket API for real-time audio transcription and HTTP endpoints for health monitoring.
./scripts/000-questions.sh ./scripts/200-build-image-local.sh --slim export DOCKER_PASSWORD='your-docker-hub-token' ./scripts/205-push-to-registry.sh --slim ./scripts/210-deploy-to-runpod.sh ./scripts/215-test-runpod-health.sh
flowchart TD
Client[Client App] -->|WSS:443| Proxy[RunPod Proxy]
Proxy -->|WS:9090| Server[WhisperLive Server]
Server -->|Load Model| Whisper[OpenAI Whisper]
Server -->|GPU Compute| GPU[NVIDIA GPU]
Monitor[Monitoring System] -->|HTTP:9999| Health[Health Check Service]
Health -->|Query Status| Server
Health -->|nvidia-smi| GPU
Python-based server wrapping the WhisperLive library, containerized with Docker. It uses shell scripts for infrastructure-as-code style deployment (building, pushing, deploying via RunPod API). The server supports multiple backends (faster_whisper, tensorrt) and includes a separate health-check microservice running on port 9999.
sequenceDiagram
participant C as Client
participant P as RunPod Proxy
participant S as WhisperLive Server
participant H as Health Service
participant W as Whisper Model
Note over C, H: Deployment Phase
H->>S: Check readiness
S->>W: Load Model (small.en)
W-->>S: Model Loaded
S-->>H: Ready
Note over C, W: Transcription Phase
C->>P: Connect WebSocket (wss://...)
P->>S: Forward Connection
S-->>C: Connection Accepted
loop Audio Stream
C->>S: Send Audio Chunk (100ms)
S->>W: Transcribe Chunk
W-->>S: Text Result
S-->>C: Send Transcription
end
Use this project to spin up a scalable, cost-effective transcription endpoint on cloud GPUs without managing complex Kubernetes clusters. Ideal for applications requiring low-latency speech-to-text where you want to pay only for active GPU usage via RunPod's serverless or pod model.
✓ all on main — nothing unmerged.