whisperX-runpod

Scripts and handlers for deploying WhisperX with speaker diarization on RunPod or AWS EC2 GPU instances.

https://github.com/davidbmar/whisperX-runpod  ·  public  ·  shipped

What it is

A deployment toolkit that packages WhisperX (fast automatic speech recognition) with pyannote (speaker diarization) into a Docker container. It provides shell scripts to build the image, push it to Docker Hub, and orchestrate deployments to either AWS EC2 for testing or RunPod for cost-effective production transcription via HTTP API.

Features

Quickstart

git clone https://github.com/davidbmar/whisperX-runpod.git
cd whisperX-runpod
./scripts/010-setup--configure-environment.sh
./scripts/100-build--docker-image.sh
./scripts/110-build--push-to-dockerhub.sh
./scripts/300-runpod--create-pod.sh

Architecture

flowchart TD
    A[Build Box / Local] -->|1. docker build| B(Docker Image)
    B -->|2. docker push| C[Docker Hub]
    A -->|3. AWS CLI| D[AWS EC2 GPU Instance]
    A -->|4. RunPod API| E[RunPod GPU Pod]
    C -->|Pull Image| D
    C -->|Pull Image| E
    D -->|Run Container| F[FastAPI Server :8000]
    E -->|Run Container| G[FastAPI Server :8000]
    F -->|Transcribe| H[(WhisperX + Pyannote)]
    G -->|Transcribe| H

How it's built

The project uses Shell scripts for infrastructure orchestration (AWS CLI, RunPod API) and Python for the application logic. The core engine is built on `whisperx` (using `faster-whisper` backend) and `pyannote.audio`. The API layer is implemented with FastAPI (`handler_pod.py`) for pod deployments and a standard RunPod handler (`handler.py`) for serverless endpoints. Docker is used to bundle dependencies and pre-download models.

How it runs

sequenceDiagram
    participant Client
    participant API as FastAPI Handler
    participant Transcriber as WhisperXTranscriber
    participant Model as Whisper/Pyannote Models
    
    Client->>API: POST /transcribe {audio_url}
    API->>API: Download audio to temp file
    API->>Transcriber: transcribe(audio_path)
    Transcriber->>Model: Load Whisper model (cached)
    Model-->>Transcriber: Raw transcription
    Transcriber->>Model: Align with wav2vec2
    Model-->>Transcriber: Word-level timestamps
    alt Diarization Enabled
        Transcriber->>Model: Load Pyannote models
        Model-->>Transcriber: Speaker segments
        Transcriber->>Transcriber: Merge speakers with words
    end
    Transcriber-->>API: JSON Result
    API-->>Client: 200 OK {segments, speakers}

How to apply & reuse

Use this project when you need high-speed, accurate transcription with speaker identification and word-level timestamps, but want to avoid managing complex GPU environments manually. It is ideal for batch processing podcasts, meetings, or interviews where cost efficiency (via RunPod community cloud) and ease of deployment are priorities.

At a glance

CapabilitiesBatch audio transcriptionSpeaker diarizationWord-level alignmentREST API hostingGPU cloud orchestrationDocker containerization
ComponentsShell Scripts (Orchestration)DockerfileFastAPI ApplicationRunPod HandlerWhisperX WrapperUtility Modules
TechPythonShellDockerFastAPIWhisperXPyannote.audioFaster-WhisperAWS CLIRunPod API
Depends onDockerAWS Account (for EC2 testing)RunPod AccountHugging Face Token (for Pyannote)Docker Hub Account
Integrates withRunPod Serverless/PodsAWS EC2Docker HubHugging Face Hub
PatternsContainerized MicroserviceInfrastructure as Code (Scripted)API Gateway PatternLazy Model LoadingCloud Agnostic Deployment
Reuse tagsspeech-to-textgpu-deploymentrunpodaws-ec2diarizationwhisperxfastapi

Repo hygiene

✓ all on main — nothing unmerged.