NVIDIA Riva RNN-T Real-Time Transcription System

What it is

A production-ready deployment framework for NVIDIA Riva Speech Skills, specifically optimized for Recurrent Neural Network Transducer (RNN-T) models. It provides a WebSocket-enabled FastAPI wrapper around the Riva gRPC service, enabling streaming transcription with 100-200ms latency. The system automates the provisioning of AWS EC2 g4dn.xlarge instances, manages S3-based artifact storage for Riva containers, and includes comprehensive health monitoring and testing scripts.

Features

Ultra-low latency streaming transcription (100-200ms) via WebSocket
Automated AWS EC2 GPU instance provisioning and Riva installation
S3-optimized binary management for efficient container deployment
FastAPI REST wrapper for file upload and S3-triggered transcription
Comprehensive health checks and performance benchmarking tools
Mock server support for CPU-based development and testing

Quickstart

git clone https://github.com/davidbmar/nvidia-riva-rnnt-transcription.git
cd nvidia-riva-rnnt-transcription
./scripts/step-001-download-riva-to-s3.sh
./scripts/step-002-organize-s3-bintarball.sh
./scripts/step-003-prepare-gpu-instance.sh
./scripts/step-004-install-riva-from-s3.sh
./scripts/step-005-configure-riva-services.sh
./scripts/step-006-test-riva-deployment.sh

Architecture

flowchart TD
    Client[Client Application] -->|WebSocket/HTTP| FastAPI[FastAPI Server]
    FastAPI -->|gRPC| Riva[Riva ASR Service]
    Riva -->|GPU Compute| GPU[NVIDIA GPU]
    FastAPI -->|Read/Write| S3[(AWS S3 Bucket)]
    S3 -->|Audio Input| FastAPI
    FastAPI -->|Transcript Output| S3
    subgraph AWS EC2 Instance
        FastAPI
        Riva
        GPU
    end

How it's built

The system is constructed using a series of six sequential Bash scripts that handle infrastructure-as-code tasks via AWS CLI. It downloads NVIDIA Riva binaries to S3, launches a GPU-enabled EC2 instance, installs Docker and NVIDIA drivers, and configures the Riva server. The application layer consists of Python FastAPI servers that interface with the Riva gRPC endpoint for transcription and use boto3 for S3 integration. A mock server is also included for CPU-only development and testing.

How it runs

sequenceDiagram
    participant C as Client
    participant F as FastAPI Server
    participant R as Riva gRPC Service
    participant G as NVIDIA GPU
    participant S as AWS S3

    C->>F: POST /transcribe/file or WS Connect
    alt File Upload
        F->>S: Download Audio File
        S-->>F: Audio Data
        F->>R: StreamingRecognize Request
        R->>G: Process Audio Frames
        G-->>R: Transcription Tokens
        R-->>F: Final Transcript
        F->>S: Upload Transcript JSON
    else WebSocket Stream
        C->>F: Stream Audio Chunks
        F->>R: Forward Audio Chunks
        R->>G: Real-time Inference
        G-->>R: Partial/Final Results
        R-->>F: Stream Response
        F-->>C: Stream Transcript Updates
    end

How to apply & reuse

This project is applied by executing the provided deployment scripts in sequence to provision a dedicated GPU instance in your AWS account. Once deployed, applications can connect to the exposed FastAPI endpoints for file-based or streaming transcription, or directly to the Riva gRPC port for high-performance integration. It is suitable for real-time captioning, live meeting transcription, and voice-controlled interfaces requiring sub-second response times.

At a glance

CapabilitiesReal-time streaming transcriptionBatch file transcriptionS3 event-driven processingGPU-accelerated inferenceAutomated infrastructure deploymentHealth monitoring and validation

ComponentsDeployment Scripts (Bash)FastAPI Application ServerRiva Speech Services ContainerMock RNN-T ServerAWS CLI IntegrationDocker Compose Configuration

TechPythonBashFastAPINVIDIA RivagRPCDockerAWS EC2AWS S3WebSocket

Depends onNVIDIA NGC AccountAWS Account with EC2/S3 permissionsAWS CLIGitSSH Key Pair

Integrates withAWS S3NVIDIA Riva gRPC APIWebSocket ClientsREST API Consumers

PatternsInfrastructure as Code (Scripted)Microservices ArchitectureEvent-Driven ProcessingStreaming Data PipelineWrapper Pattern (FastAPI over gRPC)

Reuse tagsspeech-recognitionnvidia-rivaaws-deploymentreal-time-transcriptiongpu-accelerationfastapiwebsocket-streaming

⚠ Needs attention

unmerged_branch: dependabot/pip/docker/fast-api-rnnt/pip-14c377a4fb is 1 commit ahead of the default branch
open_pr: PR #1: Bump python-multipart from 0.0.6 to 0.0.27 in /docker/fast-api-rnnt in the pip group across 1 directory