Automated deployment of a GPU-accelerated SpeechBrain Conformer RNN-T speech recognition server with WebSocket streaming and S3 integration.
https://github.com/davidbmar/nvidia-rnn-t-riva-nonmock-really-transcribe- · public · shipped
A production-ready shell-scripted deployment framework for NVIDIA RNN-T (Recurrent Neural Network Transducer) models. It provisions AWS GPU instances, installs dependencies, and launches a FastAPI server capable of real-time audio transcription via HTTP or WebSocket, offering lower latency and higher throughput than standard Whisper implementations.
git clone https://github.com/davidbmar/nvidia-rnn-t-riva-nonmock-really-transcribe.git cd nvidia-rnn-t-riva-nonmock-really-transcribe ./scripts/run-all-steps.sh
flowchart TD
Client[Client App] -->|HTTP POST / WebSocket| API[FastAPI Server :8000]
API -->|Load/Query| Model[SpeechBrain RNN-T Model]
Model -->|CUDA Inference| GPU[NVIDIA GPU T4/V100]
API -->|Fetch Audio| S3[(AWS S3 Bucket)]
API -->|JSON Response| Client
subgraph Deployment
API
Model
GPU
end
The system uses Bash scripts for infrastructure provisioning (AWS EC2 g4dn instances) and environment setup. The core application is a Python FastAPI server wrapping the SpeechBrain `EncoderDecoderASR` model. It utilizes PyTorch for GPU inference, Boto3 for S3 audio retrieval, and WebSockets for low-latency streaming. Client examples are provided in Python and Node.js.
sequenceDiagram
participant C as Client
participant S as FastAPI Server
participant M as RNN-T Model
participant G as GPU
participant B as AWS S3
alt File Transcription
C->>S: POST /transcribe/file (audio.wav)
S->>B: GetObject(audio.wav)
B-->>S: Audio Data
S->>M: Encode & Decode Audio
M->>G: CUDA Inference
G-->>M: Logits/Transcript
M-->>S: Text + Timestamps
S-->>C: JSON Response
else Streaming Transcription
C->>S: WebSocket Connect
S-->>C: Connection Established
loop Audio Chunks
C->>S: Binary Audio Chunk
S->>M: Process Chunk
M->>G: Incremental Inference
G-->>M: Partial Transcript
M-->>S: Update State
S-->>C: Partial JSON Result
end
C->>S: Close Connection
end
Use this project to deploy a high-performance, low-latency speech-to-text service on AWS. It is suitable for applications requiring word-level timestamps, real-time streaming transcription, or processing large volumes of audio files stored in S3 with minimal GPU memory footprint (~2GB VRAM).
✓ all on main — nothing unmerged.