CloudDrive: Real-Time Transcription Platform

A production-ready SaaS for real-time speech-to-text using WhisperLive, AWS Cognito, and serverless backend.

https://github.com/davidbmar/transcription-realtime-whisper-cognito-s3-lambda  ·  public  ·  shipped

What it is

CloudDrive is a full-stack web application that provides real-time audio transcription via WebSocket-connected GPU instances running WhisperLive. It features a vanilla JavaScript frontend hosted on S3/CloudFront, authenticated by AWS Cognito, with batch processing capabilities via Node.js Lambda functions. The system supports offline recording with IndexedDB retry queues and includes a transcript editor with word-level highlighting.

Features

Quickstart

./scripts/005-setup-configuration.sh
./scripts/010-setup-edge-box.sh
./scripts/305-setup-whisperlive-edge.sh
./scripts/020-deploy-gpu-instance.sh
./scripts/310-configure-whisperlive-gpu.sh
./scripts/030-configure-gpu-security.sh
./scripts/031-configure-edge-box-security.sh
./scripts/420-deploy-cognito-stack.sh
./scripts/425-deploy-recorder-ui.sh
./scripts/430-create-cognito-user.sh

Architecture

flowchart TD
    User[Browser Client] -->|WSS| Edge[Caddy Edge Box]
    User -->|HTTPS| CF[CloudFront CDN]
    Edge -->|TCP:9090| GPU[GPU Instance: WhisperLive]
    CF -->|Static Assets| S3[S3 Bucket]
    CF -->|API Requests| AGW[API Gateway]
    AGW -->|Auth| Cognito[AWS Cognito]
    AGW -->|Invoke| Lambda[AWS Lambda Node.js]
    Lambda -->|Read/Write| S3
    Lambda -->|Trigger| Batch[Batch Transcription]

How it's built

The frontend is built with vanilla HTML/JS, using the MediaRecorder API and WebSockets. The backend uses the Serverless Framework to deploy Node.js 18.x Lambda functions behind API Gateway. Infrastructure includes an Edge Box (Caddy reverse proxy) and a GPU instance (g4dn.xlarge) for WhisperLive. Deployment is automated via Bash scripts handling AWS CLI, CloudFormation, and instance configuration.

How it runs

sequenceDiagram
    participant Browser
    participant Caddy as Edge Box (Caddy)
    participant Whisper as WhisperLive (GPU)
    participant S3
    participant Lambda
    
    Browser->>Caddy: WebSocket Connect /transcribe
    Caddy->>Whisper: Forward Audio Stream
    Whisper-->>Caddy: Real-time Text Chunks
    Caddy-->>Browser: Return Transcribed Text
    
    Note over Browser,S3: Session End / Batch Upload
    Browser->>S3: Upload Audio File (via Presigned URL)
    S3-->>Browser: Confirm Upload
    Browser->>Lambda: Trigger Batch Transcription
    Lambda->>Whisper: Submit Audio for Processing
    Whisper-->>Lambda: Final Transcript
    Lambda->>S3: Store Transcript JSON

How to apply & reuse

Use this project as a reference architecture for building low-latency AI-powered media services on AWS. It demonstrates how to bridge browser-based media capture with high-performance GPU inference endpoints while maintaining secure, serverless storage and authentication patterns. The template-based UI deployment strategy is also reusable for static sites requiring environment-specific configuration injection.

At a glance

CapabilitiesReal-time Speech-to-TextBatch Audio ProcessingOffline Data PersistenceSecure User AuthenticationCloud Storage IntegrationAutomated Infrastructure Provisioning
ComponentsVanilla JS FrontendCaddy Reverse ProxyWhisperLive Inference EngineAWS Lambda FunctionsServerless Framework ConfigBash Deployment Scripts
TechHTML5JavaScriptNode.js 18.xAWS LambdaAmazon S3Amazon CloudFrontAmazon CognitoWhisperLiveFaster-WhisperCaddyServerless FrameworkPlaywright
Depends onAWS CLINode.js 18+Bash ShellSSH Key PairGPU Instance (g4dn.xlarge)
Integrates withAWS CognitoAmazon S3Amazon CloudFrontGoogle Docs (via API)IndexedDB
PatternsServerless BackendEdge ComputingWebSocket StreamingOffline-First ArchitectureInfrastructure as Code (Scripts)Template-Based UI Deployment
Reuse tagsreal-time-transcriptionaws-serverlesswhisper-livecognito-authgpu-inferencevanilla-jssaas-template

⚠ Needs attention