NVIDIA RNN-T Production Transcription System

Automated deployment of a GPU-accelerated SpeechBrain Conformer RNN-T speech recognition server with WebSocket streaming and S3 integration.

https://github.com/davidbmar/nvidia-rnn-t-riva-nonmock-really-transcribe-  ·  public  ·  shipped

What it is

A production-ready shell-scripted deployment framework for NVIDIA RNN-T (Recurrent Neural Network Transducer) models. It provisions AWS GPU instances, installs dependencies, and launches a FastAPI server capable of real-time audio transcription via HTTP or WebSocket, offering lower latency and higher throughput than standard Whisper implementations.

Features

Quickstart

git clone https://github.com/davidbmar/nvidia-rnn-t-riva-nonmock-really-transcribe.git
cd nvidia-rnn-t-riva-nonmock-really-transcribe
./scripts/run-all-steps.sh

Architecture

flowchart TD
    Client[Client App] -->|HTTP POST / WebSocket| API[FastAPI Server :8000]
    API -->|Load/Query| Model[SpeechBrain RNN-T Model]
    Model -->|CUDA Inference| GPU[NVIDIA GPU T4/V100]
    API -->|Fetch Audio| S3[(AWS S3 Bucket)]
    API -->|JSON Response| Client
    subgraph Deployment
        API
        Model
        GPU
    end

How it's built

The system uses Bash scripts for infrastructure provisioning (AWS EC2 g4dn instances) and environment setup. The core application is a Python FastAPI server wrapping the SpeechBrain `EncoderDecoderASR` model. It utilizes PyTorch for GPU inference, Boto3 for S3 audio retrieval, and WebSockets for low-latency streaming. Client examples are provided in Python and Node.js.

How it runs

sequenceDiagram
    participant C as Client
    participant S as FastAPI Server
    participant M as RNN-T Model
    participant G as GPU
    participant B as AWS S3

    alt File Transcription
        C->>S: POST /transcribe/file (audio.wav)
        S->>B: GetObject(audio.wav)
        B-->>S: Audio Data
        S->>M: Encode & Decode Audio
        M->>G: CUDA Inference
        G-->>M: Logits/Transcript
        M-->>S: Text + Timestamps
        S-->>C: JSON Response
    else Streaming Transcription
        C->>S: WebSocket Connect
        S-->>C: Connection Established
        loop Audio Chunks
            C->>S: Binary Audio Chunk
            S->>M: Process Chunk
            M->>G: Incremental Inference
            G-->>M: Partial Transcript
            M-->>S: Update State
            S-->>C: Partial JSON Result
        end
        C->>S: Close Connection
    end

How to apply & reuse

Use this project to deploy a high-performance, low-latency speech-to-text service on AWS. It is suitable for applications requiring word-level timestamps, real-time streaming transcription, or processing large volumes of audio files stored in S3 with minimal GPU memory footprint (~2GB VRAM).

At a glance

CapabilitiesSpeech-to-Text TranscriptionReal-time Audio StreamingAWS S3 IntegrationGPU AccelerationWord-Level TimingHealth Monitoring
ComponentsFastAPI ServerSpeechBrain EncoderDecoderASRDeployment ScriptsWebSocket HandlerS3 ClientPyTorch CUDA Backend
TechPythonBashFastAPIPyTorchSpeechBrainWebSocketsAWS EC2AWS S3
Depends onAWS AccountNVIDIA GPU Instance (g4dn.xlarge+)Python 3.10+CUDA ToolkitBoto3Uvicorn
Integrates withAWS S3AWS EC2Node.js ClientsPython ClientsWeb Browsers
PatternsREST APIWebSocket StreamingInfrastructure as Code (Shell)Model-as-a-ServiceAsync Processing
Reuse tagsspeech-recognitiongpu-deploymentaws-automationreal-time-transcriptionrnntfastapi

Repo hygiene

✓ all on main — nothing unmerged.