NVIDIA Riva RNN-T Real-Time Transcription System

Ultra-low latency real-time audio transcription system leveraging NVIDIA Riva RNN-T on AWS GPU instances.

https://github.com/davidbmar/nvidia-riva-rnnt-transcription  ·  private  ·  shipped

What it is

A production-ready deployment framework for NVIDIA Riva Speech Skills, specifically optimized for Recurrent Neural Network Transducer (RNN-T) models. It provides a WebSocket-enabled FastAPI wrapper around the Riva gRPC service, enabling streaming transcription with 100-200ms latency. The system automates the provisioning of AWS EC2 g4dn.xlarge instances, manages S3-based artifact storage for Riva containers, and includes comprehensive health monitoring and testing scripts.

Features

Quickstart

git clone https://github.com/davidbmar/nvidia-riva-rnnt-transcription.git
cd nvidia-riva-rnnt-transcription
./scripts/step-001-download-riva-to-s3.sh
./scripts/step-002-organize-s3-bintarball.sh
./scripts/step-003-prepare-gpu-instance.sh
./scripts/step-004-install-riva-from-s3.sh
./scripts/step-005-configure-riva-services.sh
./scripts/step-006-test-riva-deployment.sh

Architecture

flowchart TD
    Client[Client Application] -->|WebSocket/HTTP| FastAPI[FastAPI Server]
    FastAPI -->|gRPC| Riva[Riva ASR Service]
    Riva -->|GPU Compute| GPU[NVIDIA GPU]
    FastAPI -->|Read/Write| S3[(AWS S3 Bucket)]
    S3 -->|Audio Input| FastAPI
    FastAPI -->|Transcript Output| S3
    subgraph AWS EC2 Instance
        FastAPI
        Riva
        GPU
    end

How it's built

The system is constructed using a series of six sequential Bash scripts that handle infrastructure-as-code tasks via AWS CLI. It downloads NVIDIA Riva binaries to S3, launches a GPU-enabled EC2 instance, installs Docker and NVIDIA drivers, and configures the Riva server. The application layer consists of Python FastAPI servers that interface with the Riva gRPC endpoint for transcription and use boto3 for S3 integration. A mock server is also included for CPU-only development and testing.

How it runs

sequenceDiagram
    participant C as Client
    participant F as FastAPI Server
    participant R as Riva gRPC Service
    participant G as NVIDIA GPU
    participant S as AWS S3

    C->>F: POST /transcribe/file or WS Connect
    alt File Upload
        F->>S: Download Audio File
        S-->>F: Audio Data
        F->>R: StreamingRecognize Request
        R->>G: Process Audio Frames
        G-->>R: Transcription Tokens
        R-->>F: Final Transcript
        F->>S: Upload Transcript JSON
    else WebSocket Stream
        C->>F: Stream Audio Chunks
        F->>R: Forward Audio Chunks
        R->>G: Real-time Inference
        G-->>R: Partial/Final Results
        R-->>F: Stream Response
        F-->>C: Stream Transcript Updates
    end

How to apply & reuse

This project is applied by executing the provided deployment scripts in sequence to provision a dedicated GPU instance in your AWS account. Once deployed, applications can connect to the exposed FastAPI endpoints for file-based or streaming transcription, or directly to the Riva gRPC port for high-performance integration. It is suitable for real-time captioning, live meeting transcription, and voice-controlled interfaces requiring sub-second response times.

At a glance

CapabilitiesReal-time streaming transcriptionBatch file transcriptionS3 event-driven processingGPU-accelerated inferenceAutomated infrastructure deploymentHealth monitoring and validation
ComponentsDeployment Scripts (Bash)FastAPI Application ServerRiva Speech Services ContainerMock RNN-T ServerAWS CLI IntegrationDocker Compose Configuration
TechPythonBashFastAPINVIDIA RivagRPCDockerAWS EC2AWS S3WebSocket
Depends onNVIDIA NGC AccountAWS Account with EC2/S3 permissionsAWS CLIGitSSH Key Pair
Integrates withAWS S3NVIDIA Riva gRPC APIWebSocket ClientsREST API Consumers
PatternsInfrastructure as Code (Scripted)Microservices ArchitectureEvent-Driven ProcessingStreaming Data PipelineWrapper Pattern (FastAPI over gRPC)
Reuse tagsspeech-recognitionnvidia-rivaaws-deploymentreal-time-transcriptiongpu-accelerationfastapiwebsocket-streaming

⚠ Needs attention