A production-ready SaaS for real-time speech-to-text using WhisperLive, AWS Cognito, and serverless backend.
https://github.com/davidbmar/transcription-realtime-whisper-cognito-s3-lambda · public · shipped
CloudDrive is a full-stack web application that provides real-time audio transcription via WebSocket-connected GPU instances running WhisperLive. It features a vanilla JavaScript frontend hosted on S3/CloudFront, authenticated by AWS Cognito, with batch processing capabilities via Node.js Lambda functions. The system supports offline recording with IndexedDB retry queues and includes a transcript editor with word-level highlighting.
./scripts/005-setup-configuration.sh ./scripts/010-setup-edge-box.sh ./scripts/305-setup-whisperlive-edge.sh ./scripts/020-deploy-gpu-instance.sh ./scripts/310-configure-whisperlive-gpu.sh ./scripts/030-configure-gpu-security.sh ./scripts/031-configure-edge-box-security.sh ./scripts/420-deploy-cognito-stack.sh ./scripts/425-deploy-recorder-ui.sh ./scripts/430-create-cognito-user.sh
flowchart TD
User[Browser Client] -->|WSS| Edge[Caddy Edge Box]
User -->|HTTPS| CF[CloudFront CDN]
Edge -->|TCP:9090| GPU[GPU Instance: WhisperLive]
CF -->|Static Assets| S3[S3 Bucket]
CF -->|API Requests| AGW[API Gateway]
AGW -->|Auth| Cognito[AWS Cognito]
AGW -->|Invoke| Lambda[AWS Lambda Node.js]
Lambda -->|Read/Write| S3
Lambda -->|Trigger| Batch[Batch Transcription]
The frontend is built with vanilla HTML/JS, using the MediaRecorder API and WebSockets. The backend uses the Serverless Framework to deploy Node.js 18.x Lambda functions behind API Gateway. Infrastructure includes an Edge Box (Caddy reverse proxy) and a GPU instance (g4dn.xlarge) for WhisperLive. Deployment is automated via Bash scripts handling AWS CLI, CloudFormation, and instance configuration.
sequenceDiagram
participant Browser
participant Caddy as Edge Box (Caddy)
participant Whisper as WhisperLive (GPU)
participant S3
participant Lambda
Browser->>Caddy: WebSocket Connect /transcribe
Caddy->>Whisper: Forward Audio Stream
Whisper-->>Caddy: Real-time Text Chunks
Caddy-->>Browser: Return Transcribed Text
Note over Browser,S3: Session End / Batch Upload
Browser->>S3: Upload Audio File (via Presigned URL)
S3-->>Browser: Confirm Upload
Browser->>Lambda: Trigger Batch Transcription
Lambda->>Whisper: Submit Audio for Processing
Whisper-->>Lambda: Final Transcript
Lambda->>S3: Store Transcript JSON
Use this project as a reference architecture for building low-latency AI-powered media services on AWS. It demonstrates how to bridge browser-based media capture with high-performance GPU inference endpoints while maintaining secure, serverless storage and authentication patterns. The template-based UI deployment strategy is also reusable for static sites requiring environment-specific configuration injection.