Scripts and handlers for deploying WhisperX with speaker diarization on RunPod or AWS EC2 GPU instances.
https://github.com/davidbmar/whisperX-runpod · public · shipped
A deployment toolkit that packages WhisperX (fast automatic speech recognition) with pyannote (speaker diarization) into a Docker container. It provides shell scripts to build the image, push it to Docker Hub, and orchestrate deployments to either AWS EC2 for testing or RunPod for cost-effective production transcription via HTTP API.
git clone https://github.com/davidbmar/whisperX-runpod.git cd whisperX-runpod ./scripts/010-setup--configure-environment.sh ./scripts/100-build--docker-image.sh ./scripts/110-build--push-to-dockerhub.sh ./scripts/300-runpod--create-pod.sh
flowchart TD
A[Build Box / Local] -->|1. docker build| B(Docker Image)
B -->|2. docker push| C[Docker Hub]
A -->|3. AWS CLI| D[AWS EC2 GPU Instance]
A -->|4. RunPod API| E[RunPod GPU Pod]
C -->|Pull Image| D
C -->|Pull Image| E
D -->|Run Container| F[FastAPI Server :8000]
E -->|Run Container| G[FastAPI Server :8000]
F -->|Transcribe| H[(WhisperX + Pyannote)]
G -->|Transcribe| H
The project uses Shell scripts for infrastructure orchestration (AWS CLI, RunPod API) and Python for the application logic. The core engine is built on `whisperx` (using `faster-whisper` backend) and `pyannote.audio`. The API layer is implemented with FastAPI (`handler_pod.py`) for pod deployments and a standard RunPod handler (`handler.py`) for serverless endpoints. Docker is used to bundle dependencies and pre-download models.
sequenceDiagram
participant Client
participant API as FastAPI Handler
participant Transcriber as WhisperXTranscriber
participant Model as Whisper/Pyannote Models
Client->>API: POST /transcribe {audio_url}
API->>API: Download audio to temp file
API->>Transcriber: transcribe(audio_path)
Transcriber->>Model: Load Whisper model (cached)
Model-->>Transcriber: Raw transcription
Transcriber->>Model: Align with wav2vec2
Model-->>Transcriber: Word-level timestamps
alt Diarization Enabled
Transcriber->>Model: Load Pyannote models
Model-->>Transcriber: Speaker segments
Transcriber->>Transcriber: Merge speakers with words
end
Transcriber-->>API: JSON Result
API-->>Client: 200 OK {segments, speakers}
Use this project when you need high-speed, accurate transcription with speaker identification and word-level timestamps, but want to avoid managing complex GPU environments manually. It is ideal for batch processing podcasts, meetings, or interviews where cost efficiency (via RunPod community cloud) and ease of deployment are priorities.
✓ all on main — nothing unmerged.