whisperX-runpod · davidbmar.com

What it is

A deployment toolkit that packages WhisperX (fast automatic speech recognition) with pyannote (speaker diarization) into a Docker container. It provides shell scripts to build the image, push it to Docker Hub, and orchestrate deployments to either AWS EC2 for testing or RunPod for cost-effective production transcription via HTTP API.

Features

35x realtime speed transcription using faster-whisper backend
Word-level timestamp alignment via wav2vec2
Automatic speaker diarization using pyannote.audio
Support for multiple Whisper model sizes (tiny to large-v3)
HTTP API for transcribing from URLs or file uploads
Cost-optimized deployment on RunPod community GPU clouds

Quickstart

git clone https://github.com/davidbmar/whisperX-runpod.git
cd whisperX-runpod
./scripts/010-setup--configure-environment.sh
./scripts/100-build--docker-image.sh
./scripts/110-build--push-to-dockerhub.sh
./scripts/300-runpod--create-pod.sh

Architecture

flowchart TD
    A[Build Box / Local] -->|1. docker build| B(Docker Image)
    B -->|2. docker push| C[Docker Hub]
    A -->|3. AWS CLI| D[AWS EC2 GPU Instance]
    A -->|4. RunPod API| E[RunPod GPU Pod]
    C -->|Pull Image| D
    C -->|Pull Image| E
    D -->|Run Container| F[FastAPI Server :8000]
    E -->|Run Container| G[FastAPI Server :8000]
    F -->|Transcribe| H[(WhisperX + Pyannote)]
    G -->|Transcribe| H

How it's built

The project uses Shell scripts for infrastructure orchestration (AWS CLI, RunPod API) and Python for the application logic. The core engine is built on `whisperx` (using `faster-whisper` backend) and `pyannote.audio`. The API layer is implemented with FastAPI (`handler_pod.py`) for pod deployments and a standard RunPod handler (`handler.py`) for serverless endpoints. Docker is used to bundle dependencies and pre-download models.

How it runs

sequenceDiagram
    participant Client
    participant API as FastAPI Handler
    participant Transcriber as WhisperXTranscriber
    participant Model as Whisper/Pyannote Models
    
    Client->>API: POST /transcribe {audio_url}
    API->>API: Download audio to temp file
    API->>Transcriber: transcribe(audio_path)
    Transcriber->>Model: Load Whisper model (cached)
    Model-->>Transcriber: Raw transcription
    Transcriber->>Model: Align with wav2vec2
    Model-->>Transcriber: Word-level timestamps
    alt Diarization Enabled
        Transcriber->>Model: Load Pyannote models
        Model-->>Transcriber: Speaker segments
        Transcriber->>Transcriber: Merge speakers with words
    end
    Transcriber-->>API: JSON Result
    API-->>Client: 200 OK {segments, speakers}

How to apply & reuse

Use this project when you need high-speed, accurate transcription with speaker identification and word-level timestamps, but want to avoid managing complex GPU environments manually. It is ideal for batch processing podcasts, meetings, or interviews where cost efficiency (via RunPod community cloud) and ease of deployment are priorities.

At a glance

CapabilitiesBatch audio transcriptionSpeaker diarizationWord-level alignmentREST API hostingGPU cloud orchestrationDocker containerization

ComponentsShell Scripts (Orchestration)DockerfileFastAPI ApplicationRunPod HandlerWhisperX WrapperUtility Modules

TechPythonShellDockerFastAPIWhisperXPyannote.audioFaster-WhisperAWS CLIRunPod API

Depends onDockerAWS Account (for EC2 testing)RunPod AccountHugging Face Token (for Pyannote)Docker Hub Account

Integrates withRunPod Serverless/PodsAWS EC2Docker HubHugging Face Hub

PatternsContainerized MicroserviceInfrastructure as Code (Scripted)API Gateway PatternLazy Model LoadingCloud Agnostic Deployment

Reuse tagsspeech-to-textgpu-deploymentrunpodaws-ec2diarizationwhisperxfastapi