YouTube Phrase Scanner · davidbmar.com

What it is

A distributed system for processing YouTube video content at scale. It uses an SQS queue to manage jobs, AWS S3 for stateful job tracking and result storage, and GPU-enabled workers to perform high-accuracy transcription using WhisperX. The system is designed for resilience, featuring automatic job recovery, segment-level progress tracking, and fallback download mechanisms.

Features

Queue-based processing using AWS SQS for scalable job management
GPU-accelerated transcription using WhisperX with word-level timestamps
Resilient download strategy with yt-dlp and PyTubeFix fallbacks
Self-healing job recovery with segment-level progress tracking in S3
Automated phrase scanning and statistical analysis of transcripts
Support for CPU-only mode and containerized deployment via Docker

Quickstart

git clone https://github.com/davidbmar/youtube_transcriber_2
cd youtube_transcriber_2
pip install -r requirements.txt
python worker.py --queue_url YOUR_SQS_QUEUE_URL --s3_bucket YOUR_S3_BUCKET

Architecture

flowchart TD
    User[User] -->|Enqueue URL| SQS[AWS SQS Queue]
    SQS -->|Poll| Worker[Worker Instance]
    Worker -->|Read/Write State| S3[(AWS S3 Bucket)]
    Worker -->|Download Audio| YT[YouTube]
    Worker -->|Transcribe| WhisperX[WhisperX Model]
    Worker -->|Manage GPU| Lambda[AWS Lambda]
    Lambda -->|API Call| RunPod[RunPod Service]
    subgraph Storage
        S3
    end
    subgraph Compute
        Worker
        WhisperX
    end

How it's built

The core logic is implemented in Python. It relies on `yt-dlp` (with `PyTubeFix` fallback) for downloading, `ffmpeg` for audio conversion, and `WhisperX` for transcription. State management is handled via a custom `JobTracker` class interacting with AWS S3. The architecture includes a main `worker.py` loop, auxiliary scripts for queue management, and optional AWS Lambda functions for managing RunPod GPU instances.

How it runs

sequenceDiagram
    participant U as User
    participant Q as SQS Queue
    participant W as Worker
    participant S3 as S3 Storage
    participant DL as Downloader
    participant TX as Transcriber
    participant SC as Scanner

    U->>Q: Send YouTube URL
    loop Polling Interval
        W->>Q: Receive Message
        W->>S3: Create Job (queued -> processing)
        W->>DL: Download Audio (yt-dlp/PyTube)
        DL-->>W: Return Audio File
        W->>S3: Update Progress (segment level)
        W->>TX: Transcribe Segments (WhisperX)
        TX-->>W: Return Transcript Data
        W->>SC: Scan for Phrases
        SC-->>W: Return Results
        W->>S3: Save Results & Mark Completed
    end

How to apply & reuse

Deploy the worker script on a GPU-enabled instance (local or cloud). Configure AWS credentials with S3 and SQS permissions. Set environment variables for the target S3 bucket and SQS queue URL. Run the worker to start polling for jobs, and use the helper script to enqueue YouTube URLs for processing.

At a glance

CapabilitiesBatch video processingPhrase occurrence statisticsSegment-level resumabilitySpot instance toleranceMulti-method video downloadingJSON result export

Componentsworker.pyjob_tracker.pydownloader.pytranscriber.pyscanner.pysend_to_queue.pyrunpod_manager.py

TechPython 3.7+WhisperXyt-dlpffmpegAWS SQSAWS S3Boto3Docker

Depends onAWS AccountGPU (optional but recommended)FFmpeg binaryRunPod API Key (optional)

Integrates withAWS S3AWS SQSAWS LambdaRunPodYouTube

PatternsWorker QueueState MachineFallback StrategyServerless ComputeEvent-Driven Architecture

Reuse tagsvideo-processingtranscriptionaws-serverlessgpu-computingdata-pipelinecontent-analysis