A serverless, GPU-accelerated pipeline that downloads YouTube videos, transcribes them with WhisperX, and scans for specific phrases.
https://github.com/davidbmar/youtube_transcriber_2 · public · shipped
A distributed system for processing YouTube video content at scale. It uses an SQS queue to manage jobs, AWS S3 for stateful job tracking and result storage, and GPU-enabled workers to perform high-accuracy transcription using WhisperX. The system is designed for resilience, featuring automatic job recovery, segment-level progress tracking, and fallback download mechanisms.
git clone https://github.com/davidbmar/youtube_transcriber_2 cd youtube_transcriber_2 pip install -r requirements.txt python worker.py --queue_url YOUR_SQS_QUEUE_URL --s3_bucket YOUR_S3_BUCKET
flowchart TD
User[User] -->|Enqueue URL| SQS[AWS SQS Queue]
SQS -->|Poll| Worker[Worker Instance]
Worker -->|Read/Write State| S3[(AWS S3 Bucket)]
Worker -->|Download Audio| YT[YouTube]
Worker -->|Transcribe| WhisperX[WhisperX Model]
Worker -->|Manage GPU| Lambda[AWS Lambda]
Lambda -->|API Call| RunPod[RunPod Service]
subgraph Storage
S3
end
subgraph Compute
Worker
WhisperX
end
The core logic is implemented in Python. It relies on `yt-dlp` (with `PyTubeFix` fallback) for downloading, `ffmpeg` for audio conversion, and `WhisperX` for transcription. State management is handled via a custom `JobTracker` class interacting with AWS S3. The architecture includes a main `worker.py` loop, auxiliary scripts for queue management, and optional AWS Lambda functions for managing RunPod GPU instances.
sequenceDiagram
participant U as User
participant Q as SQS Queue
participant W as Worker
participant S3 as S3 Storage
participant DL as Downloader
participant TX as Transcriber
participant SC as Scanner
U->>Q: Send YouTube URL
loop Polling Interval
W->>Q: Receive Message
W->>S3: Create Job (queued -> processing)
W->>DL: Download Audio (yt-dlp/PyTube)
DL-->>W: Return Audio File
W->>S3: Update Progress (segment level)
W->>TX: Transcribe Segments (WhisperX)
TX-->>W: Return Transcript Data
W->>SC: Scan for Phrases
SC-->>W: Return Results
W->>S3: Save Results & Mark Completed
end
Deploy the worker script on a GPU-enabled instance (local or cloud). Configure AWS credentials with S3 and SQS permissions. Set environment variables for the target S3 bucket and SQS queue URL. Run the worker to start polling for jobs, and use the helper script to enqueue YouTube URLs for processing.
✓ all on main — nothing unmerged.