Ultra-low latency real-time audio transcription system leveraging NVIDIA Riva RNN-T on AWS GPU instances.
https://github.com/davidbmar/nvidia-riva-rnnt-transcription · private · shipped
A production-ready deployment framework for NVIDIA Riva Speech Skills, specifically optimized for Recurrent Neural Network Transducer (RNN-T) models. It provides a WebSocket-enabled FastAPI wrapper around the Riva gRPC service, enabling streaming transcription with 100-200ms latency. The system automates the provisioning of AWS EC2 g4dn.xlarge instances, manages S3-based artifact storage for Riva containers, and includes comprehensive health monitoring and testing scripts.
git clone https://github.com/davidbmar/nvidia-riva-rnnt-transcription.git cd nvidia-riva-rnnt-transcription ./scripts/step-001-download-riva-to-s3.sh ./scripts/step-002-organize-s3-bintarball.sh ./scripts/step-003-prepare-gpu-instance.sh ./scripts/step-004-install-riva-from-s3.sh ./scripts/step-005-configure-riva-services.sh ./scripts/step-006-test-riva-deployment.sh
flowchart TD
Client[Client Application] -->|WebSocket/HTTP| FastAPI[FastAPI Server]
FastAPI -->|gRPC| Riva[Riva ASR Service]
Riva -->|GPU Compute| GPU[NVIDIA GPU]
FastAPI -->|Read/Write| S3[(AWS S3 Bucket)]
S3 -->|Audio Input| FastAPI
FastAPI -->|Transcript Output| S3
subgraph AWS EC2 Instance
FastAPI
Riva
GPU
end
The system is constructed using a series of six sequential Bash scripts that handle infrastructure-as-code tasks via AWS CLI. It downloads NVIDIA Riva binaries to S3, launches a GPU-enabled EC2 instance, installs Docker and NVIDIA drivers, and configures the Riva server. The application layer consists of Python FastAPI servers that interface with the Riva gRPC endpoint for transcription and use boto3 for S3 integration. A mock server is also included for CPU-only development and testing.
sequenceDiagram
participant C as Client
participant F as FastAPI Server
participant R as Riva gRPC Service
participant G as NVIDIA GPU
participant S as AWS S3
C->>F: POST /transcribe/file or WS Connect
alt File Upload
F->>S: Download Audio File
S-->>F: Audio Data
F->>R: StreamingRecognize Request
R->>G: Process Audio Frames
G-->>R: Transcription Tokens
R-->>F: Final Transcript
F->>S: Upload Transcript JSON
else WebSocket Stream
C->>F: Stream Audio Chunks
F->>R: Forward Audio Chunks
R->>G: Real-time Inference
G-->>R: Partial/Final Results
R-->>F: Stream Response
F-->>C: Stream Transcript Updates
end
This project is applied by executing the provided deployment scripts in sequence to provision a dedicated GPU instance in your AWS account. Once deployed, applications can connect to the exposed FastAPI endpoints for file-based or streaming transcription, or directly to the Riva gRPC port for high-performance integration. It is suitable for real-time captioning, live meeting transcription, and voice-controlled interfaces requiring sub-second response times.