Hybrid architecture combining NVIDIA Riva Conformer-CTC for real-time streaming ASR with an AWS serverless backend for secure, chunked audio storage and session management.
https://github.com/davidbmar/transcriber-2-pass-riva-conformer-cf-s3-lambda-cognito-adapter-2025-10-14 · public · shipped
A production-grade speech recognition system that splits processing into two paths: a low-latency WebSocket bridge to an NVIDIA Riva GPU instance for immediate transcription, and a robust AWS serverless API (Lambda + S3 + Cognito) for storing raw audio chunks, managing sessions, and finalizing recordings. It bridges the gap between real-time user experience and reliable cloud storage.
git clone https://github.com/davidbmar/transcriber-2-pass-riva-conformer-cf-s3-lambda-cognito-adapter-2025-10-14 cd transcriber-2-pass-riva-conformer-cf-s3-lambda-cognito-adapter-2025-10-14 cp .env.example .env nano .env ./scripts/010-setup-build-box.sh aws configure ./scripts/020-deploy-gpu-instance.sh ./scripts/100-deploy-conformer-streaming.sh ./scripts/110-setup-websocket-bridge.sh ./scripts/120-setup-https-demo.sh echo "Open: https://$(curl -s ifconfig.me):8444"
flowchart TD
Browser[Browser Microphone] -->|WSS Audio Chunks| WS_Bridge[WebSocket Bridge :8443]
Browser -->|HTTPS API| API_GW[AWS API Gateway]
subgraph Build_Box [Build Box / EC2]
WS_Bridge -->|gRPC Streaming| Riva[RIVA 2.19 Conformer CTC]
Demo[HTTPS Demo UI :8444] --> Browser
end
subgraph AWS_Cloud [AWS Serverless Backend]
API_GW --> Auth[Cognito Authorizer]
Auth --> Lambda[Lambda Functions]
Lambda -->|Presign/Store| S3[(S3 Bucket)]
Lambda -->|Manifest| S3
end
Riva -.->|Transcription Text| Browser
The system uses Shell scripts for infrastructure provisioning (EC2 g4dn instances, NVIDIA drivers, Docker). The real-time path uses a Python WebSocket-to-gRPC bridge connecting browsers to Riva. The storage path uses TypeScript AWS Lambda functions behind API Gateway, authenticated via Amazon Cognito JWTs, storing audio chunks in S3 with presigned URLs and maintaining session manifests.
sequenceDiagram
participant User as Browser
participant API as API Gateway/Lambda
participant S3 as S3 Storage
participant WS as WebSocket Bridge
participant Riva as NVIDIA Riva GPU
Note over User, Riva: Session Setup & Upload
User->>API: POST /sessions (Create Session)
API->>User: Return sessionId & basePrefix
loop For each audio chunk
User->>API: POST /chunks/presign
API->>User: Return Presigned PUT URL
User->>S3: PUT Audio Chunk (Direct)
User->>API: POST /chunks/complete
API->>S3: Verify Object Exists
API->>S3: Update Manifest
end
Note over User, Riva: Real-time Transcription
User->>WS: Connect WSS
WS->>Riva: Init gRPC Streaming
loop Streaming Audio
User->>WS: Send Audio Chunk
WS->>Riva: Stream Audio Data
Riva->>WS: Return Partial Transcript
WS->>User: Push Transcript
end
User->>API: POST /sessions/{id}/finalize
API->>S3: Seal Manifest
API->>User: Session Finalized
Use this when you need both instant transcription feedback for users and a permanent, searchable archive of the original audio. Ideal for meeting assistants, call center analytics, or medical dictation where latency matters but data integrity and security are paramount.