NVIDIA Parakeet Riva ASR Deployment System

What it is

A complete infrastructure-as-code and client-wrapper system for deploying NVIDIA's Parakeet Recurrent Neural Network Transducer (RNNT) model. It supports both modern NIM containers and traditional Riva servers, providing low-latency real-time transcription via WebSocket with word-level timestamps and confidence scores.

Features

Real-time streaming transcription with 100-300ms partial result latency
Dual deployment support for NVIDIA NIM containers and traditional Riva servers
Word-level timestamps and confidence scores for precise alignment
Automated AWS EC2 GPU instance provisioning and driver installation
Production-grade structured logging and error tracking framework
WebSocket API with Python and Node.js client examples

Quickstart

git clone https://github.com/davidbmar/nvidia-parakeet.git
cd nvidia-parakeet-3
./scripts/riva-010-run-complete-deployment-pipeline.sh

Architecture

flowchart TD
    ClientApps[Client Apps Browser or App] -->|WebSocket Audio Stream| WebSocketServer[WebSocket Server Port 8443]
    WebSocketServer -->|gRPC Client Wrapper| RivaASR[NVIDIA Riva ASR NIM or Traditional]
    RivaASR -->|GPU Inference| ParakeetRNNT[Parakeet RNNT Model]
    WebSocketServer -->|Logs| StructuredLogging[Structured Logging and Monitoring]
    RivaASR -->|Health Checks| HealthMonitor[Health Monitor]

How it's built

Built with Python 3.10+ using FastAPI for the backend server, Pydantic for configuration management, and websockets for real-time audio streaming. It integrates NVIDIA Riva ASR gRPC services or SpeechBrain Conformer models, orchestrated via Bash scripts for AWS EC2 GPU instance provisioning, Docker container management, and security group configuration.

How it runs

sequenceDiagram
    participant Client as Client Application
    participant WS as WebSocket Server
    participant RivaClient as RivaASRClient Wrapper
    participant Riva as NVIDIA Riva ASR Service
    participant GPU as GPU Worker
    Client->>WS: Connect WebSocket
    WS->>Client: Connection Established
    Client->>WS: Send Audio Chunk
    WS->>RivaClient: Forward Audio Data
    RivaClient->>Riva: gRPC Streaming Request
    Riva->>GPU: Execute Inference
    GPU-->>Riva: Partial Results
    Riva-->>RivaClient: Partial Transcription
    RivaClient-->>WS: Return Partial Result
    WS-->>Client: Emit Partial Text
    GPU-->>Riva: Final Results
    Riva-->>RivaClient: Final Transcription with Timestamps
    RivaClient-->>WS: Return Final Result
    WS-->>Client: Emit Final Text

How to apply & reuse

Use this system to deploy a scalable, GPU-accelerated speech-to-text service on AWS. It is suitable for applications requiring ultra-low latency transcription, such as live captioning, voice assistants, or meeting transcription tools, where word-level timing and high throughput are critical.

At a glance

CapabilitiesStreaming ASRBatch TranscriptionAWS AutomationGPU AccelerationReal-time Logging

ComponentsRivaASRClientWebSocket ServerNIM ContainerTraditional Riva ServerDeployment ScriptsLogging Framework

TechPythonFastAPIWebSocketsDockerNVIDIA RivaPydanticBash

Depends onAWS EC2NVIDIA GPU DriversDocker EngineNVIDIA Container ToolkitPython 3.10+

Integrates withNVIDIA NIMNVIDIA RivaAWS S3SpeechBrain

PatternsClient-ServerStreaming RPCInfrastructure as CodeObserver Pattern

Reuse tagsasrspeech-to-textnvidia-rivawebsocketaws-deploymentgpu-inference

⚠ Needs attention

unmerged_branch: dependabot/pip/pip-6040ed8665 is 1 commit ahead of the default branch
open_pr: PR #7: Bump the pip group across 1 directory with 11 updates