YouTube Phrase Scanner

Serverless pipeline for downloading, transcribing, and scanning YouTube videos for specific phrases using WhisperX and AWS infrastructure.

https://github.com/davidbmar/youtube_transcriber_3  ·  public  ·  shipped

What it is

A distributed system that processes YouTube video URLs from an SQS queue. It downloads audio, generates high-accuracy transcripts using WhisperX on GPU instances (managed via RunPod), scans for user-defined phrases, and stores results in S3. It includes a Lambda-based controller for managing ephemeral GPU resources.

Features

Quickstart

pip install -r docker/requirements.txt
export RUNPOD_API_KEY=your_api_key
python run.py

Architecture

flowchart TD
    A[Client] -->|Push URL| B(AWS SQS Queue)
    B -->|Trigger| C[Worker Container]
    C -->|Download Audio| D[YouTube]
    C -->|Request GPU| E[AWS Lambda]
    E -->|Manage Lifecycle| F[RunPod API]
    F -->|Provision Pod| G[GPU Instance]
    C -->|Send Audio| G
    G -->|Run WhisperX| H[Transcription]
    H -->|Scan Phrases| I[Scanner Module]
    I -->|Store Results| J(AWS S3 Bucket)

How it's built

Python-based worker architecture containerized with Docker. Uses `yt-dlp` for downloading, `WhisperX` for transcription, and `boto3` for AWS integration (SQS/S3). GPU compute is abstracted via RunPod, controlled by an AWS Lambda function that handles pod lifecycle events.

How it runs

sequenceDiagram
    participant Client
    participant SQS
    participant Worker
    participant Lambda
    participant RunPod
    participant S3
    
    Client->>SQS: Send Video URL
    SQS->>Worker: Trigger Job
    Worker->>Lambda: Request GPU Resource
    Lambda->>RunPod: Create/Get Pod
    RunPod-->>Lambda: Pod Endpoint
    Lambda-->>Worker: Return Endpoint
    Worker->>Worker: Download Audio (yt-dlp)
    Worker->>RunPod: Upload Audio & Start Transcription
    RunPod->>RunPod: Process with WhisperX
    RunPod-->>Worker: Return Transcript
    Worker->>Worker: Scan for Phrases
    Worker->>S3: Save Results

How to apply & reuse

Deploy the Lambda function to manage RunPod credentials and permissions. Configure the SQS queue to trigger the worker container. Push YouTube URLs to the queue to initiate processing. Results are written to S3 buckets for downstream analysis.

At a glance

CapabilitiesAudio ExtractionSpeech-to-TextPattern MatchingCloud OrchestrationGPU Management
Componentsworker.pydownloader.pytranscriber.pyscanner.pyrunpod_manager.pylambda_handler.py
TechPythonWhisperXDockerAWS LambdaRunPodyt-dlp
Depends onboto3runpodtorchwhisperxyt-dlp
Integrates withAWS SQSAWS S3YouTubeRunPod API
PatternsQueue-Based Load LevelingServerless ComputeEphemeral InfrastructureWorker Pattern
Reuse tagsmedia-processingai-transcriptionaws-serverlessgpu-orchestrationcontent-analysis

Repo hygiene

✓ all on main — nothing unmerged.