Scalable AWS-based audio transcription system using SQS queues, EC2 Spot instances, and WhisperX/Voxtral models.
https://github.com/davidbmar/transcription-sqs-spot-s3 · public · shipped
A production-ready infrastructure-as-code solution for transcribing audio files stored in S3. It uses an SQS queue to manage job distribution to EC2 Spot instances (GPU or CPU) running containerized or native workers. The system supports automatic scaling, cost optimization via Spot instances, and dead-letter queue handling for failed jobs.
git clone https://github.com/davidbmar/transcription-sqs-spot-s3.git cd transcription-sqs-spot-s3 ./scripts/step-000-setup-configuration.sh ./scripts/step-010-setup-iam-permissions.sh ./scripts/step-020-create-sqs-resources.sh ./scripts/step-060-choose-deployment-path.sh
flowchart TD
User[User/Application] -->|Upload Audio| S3In[(S3 Input Bucket)]
User -->|Send Job Metadata| SQS[SQS Queue]
SQS -->|Poll Jobs| Worker[EC2 Spot Worker]
Worker -->|Download Audio| S3In
Worker -->|Process| Model[WhisperX/Voxtral Model]
Worker -->|Upload Transcript| S3Out[(S3 Output Bucket)]
SQS -->|Failed Messages| DLQ[SQS Dead Letter Queue]
subgraph AWS Cloud
S3In
SQS
Worker
S3Out
DLQ
end
The system is orchestrated via Bash scripts that configure AWS resources (IAM, SQS, ECR) and deploy workers. Workers are Python applications using FastAPI, PyTorch, and Hugging Face Transformers (Whisper or Voxtral). Deployment supports two paths: traditional EC2 user-data installation or Docker containers pushed to Amazon ECR.
sequenceDiagram
participant Client as Client App
participant S3 as S3 Bucket
participant SQS as SQS Queue
participant Worker as EC2 Worker
participant Model as AI Model
Client->>S3: Upload Audio File
Client->>SQS: Send Message (S3 Path)
loop Polling
Worker->>SQS: Receive Message
end
Worker->>S3: Download Audio File
Worker->>Model: Transcribe Audio
Model-->>Worker: Return Text
Worker->>S3: Upload Transcript JSON
Worker->>SQS: Delete Message
Use this project to build a cost-effective, scalable transcription backend for applications requiring high-volume audio processing. It is suitable for podcast transcription, meeting notes, or media archival where latency is secondary to cost and reliability.