whisperlive-runpod · davidbmar.com

What it is

A deployment wrapper and configuration toolkit for running Collabora's WhisperLive on RunPod. It provides scripts to build optimized Docker images (slim or full with diarization), push them to a registry, and deploy them to RunPod GPU instances. The system exposes a WebSocket API for real-time audio transcription and HTTP endpoints for health monitoring.

Features

Real-time WebSocket-based speech-to-text transcription
Automated deployment scripts for RunPod GPU instances
Optimized Docker images (slim for transcription, full for diarization)
Built-in health check and readiness endpoints for orchestration
Support for multiple Whisper backends (Faster-Whisper, TensorRT)
Configurable model sizes and compute precision for cost/performance tuning

Quickstart

./scripts/000-questions.sh
./scripts/200-build-image-local.sh --slim
export DOCKER_PASSWORD='your-docker-hub-token'
./scripts/205-push-to-registry.sh --slim
./scripts/210-deploy-to-runpod.sh
./scripts/215-test-runpod-health.sh

Architecture

flowchart TD
    Client[Client App] -->|WSS:443| Proxy[RunPod Proxy]
    Proxy -->|WS:9090| Server[WhisperLive Server]
    Server -->|Load Model| Whisper[OpenAI Whisper]
    Server -->|GPU Compute| GPU[NVIDIA GPU]
    Monitor[Monitoring System] -->|HTTP:9999| Health[Health Check Service]
    Health -->|Query Status| Server
    Health -->|nvidia-smi| GPU

How it's built

Python-based server wrapping the WhisperLive library, containerized with Docker. It uses shell scripts for infrastructure-as-code style deployment (building, pushing, deploying via RunPod API). The server supports multiple backends (faster_whisper, tensorrt) and includes a separate health-check microservice running on port 9999.

How it runs

sequenceDiagram
    participant C as Client
    participant P as RunPod Proxy
    participant S as WhisperLive Server
    participant H as Health Service
    participant W as Whisper Model

    Note over C, H: Deployment Phase
    H->>S: Check readiness
    S->>W: Load Model (small.en)
    W-->>S: Model Loaded
    S-->>H: Ready

    Note over C, W: Transcription Phase
    C->>P: Connect WebSocket (wss://...)
    P->>S: Forward Connection
    S-->>C: Connection Accepted
    loop Audio Stream
        C->>S: Send Audio Chunk (100ms)
        S->>W: Transcribe Chunk
        W-->>S: Text Result
        S-->>C: Send Transcription
    end

How to apply & reuse

Use this project to spin up a scalable, cost-effective transcription endpoint on cloud GPUs without managing complex Kubernetes clusters. Ideal for applications requiring low-latency speech-to-text where you want to pay only for active GPU usage via RunPod's serverless or pod model.

At a glance

CapabilitiesReal-time transcriptionSpeaker diarization (full image)Multi-language supportHealth monitoringGPU acceleration

Componentsrun_server.pyrun_client.pyhealthcheck.pydeployment scriptsDockerfile

TechPythonDockerWebSocketsRunPod APIWhisperFaster-Whisper

Depends onRunPod AccountDocker Hub AccountNVIDIA GPUCollabora WhisperLive

Integrates withRunPod PlatformDocker RegistryWebSocket Clients

PatternsMicroservicesContainerizationInfrastructure as CodeHealth Check Pattern

Reuse tagsspeech-to-textgpu-deploymentrunpodwhisperreal-time