NVIDIA Parakeet Riva ASR Deployment System

Production-ready deployment pipeline for NVIDIA Parakeet RNNT speech recognition via Riva ASR with WebSocket streaming.

https://github.com/davidbmar/nvidia-parakeet-ver-5  ·  public  ·  shipped

What it is

A comprehensive infrastructure project that deploys NVIDIA's Parakeet RNNT automatic speech recognition (ASR) model using either NVIDIA NIM containers or traditional Riva servers. It provides a Python-based WebSocket server and client wrapper to enable real-time, low-latency audio transcription with word-level timestamps and confidence scores.

Features

Quickstart

git clone https://github.com/davidbmar/nvidia-parakeet.git
cd nvidia-parakeet-3
./scripts/riva-010-run-complete-deployment-pipeline.sh

Architecture

flowchart TD
    Client["Client Apps<br/>(Browser/Node.js/Python)"] -->|"WebSocket Audio Stream<br/>Port 8443"| WS["WebSocket Server<br/>(FastAPI + RivaASRClient)"]
    WS -->|"gRPC Call<br/>Port 50051"| Riva["NVIDIA Riva ASR<br/>(NIM or Traditional)<br/>Parakeet RNNT Model"]
    Riva -->|"GPU Inference<br/>(Tesla T4/V100)"| GPU["GPU Worker"]
    WS -->|"Structured Logs"| Logs["Logging & Monitoring<br/>(./logs/)"]
    subgraph Deployment
        WS
        Riva
        GPU
    end

How it's built

The system is built on Python 3.10+ and uses FastAPI for the WebSocket server interface. It integrates with NVIDIA Riva ASR via gRPC (port 50051) or HTTP (port 8000) using a custom `RivaASRClient` wrapper that supports both real GPU-accelerated inference and mock modes. Deployment is managed through Bash scripts targeting AWS EC2 GPU instances (g4dn.xlarge), handling driver installation, Docker/NVIDIA Container Runtime setup, and security group configuration.

How it runs

sequenceDiagram
    participant C as Client App
    participant W as WebSocket Server
    participant RC as RivaASRClient
    participant R as NVIDIA Riva ASR
    
    C->>W: Connect WebSocket (ws://.../ws/transcribe)
    W->>C: Connection Established
    loop Streaming Audio
        C->>W: Send Audio Chunk (PCM16)
        W->>RC: Process Audio Chunk
        RC->>R: gRPC StreamingRecognize Request
        R-->>RC: Partial Result (Interim)
        RC-->>W: Return Partial Transcription
        W-->>C: Send Partial JSON Result
    end
    C->>W: End Stream / Silence Detected
    W->>RC: Finalize Request
    RC->>R: Get Final Result
    R-->>RC: Final Transcription + Timestamps
    RC-->>W: Return Final Result
    W-->>C: Send Final JSON Result

How to apply & reuse

Use this system to deploy a scalable, low-latency speech-to-text service for applications requiring real-time transcription, such as live captioning, voice assistants, or meeting recorders. It is suitable for environments where GPU acceleration (Tesla T4/V100) is available and sub-second latency is critical.

At a glance

CapabilitiesStreaming Speech RecognitionBatch File TranscriptionS3 Audio ProcessingMock Mode TestingHealth Check MonitoringAWS EC2 Deployment Automation
Componentssrc/asr/riva_client.pydocker/rnnt-server-websocket.pydocker/rnnt-server.pyconfig/settings.pyscripts/riva-010-run-complete-deployment-pipeline.shexamples/python-client.pyexamples/nodejs-client.js
TechPython 3.10+FastAPIWebSocketsgRPCNVIDIA Riva ASRDockerNVIDIA Container ToolkitPydanticSpeechBrainBoto3
Depends onAWS Account (EC2/S3)GPU Instance (g4dn.xlarge or better)NVIDIA DriversDocker EngineNVIDIA Container Runtime
Integrates withNVIDIA NIM ContainersTraditional Riva ServerAWS S3 StorageBrowser Clients (via HTTPS/WSS)Node.js ApplicationsPython Applications
PatternsClient-Server ArchitectureWebSocket StreaminggRPC CommunicationContainerized DeploymentInfrastructure as Code (Bash Scripts)Mock Object Pattern
Reuse tagsASRSpeech-to-TextNVIDIARivaParakeetReal-timeWebSocketGPU-AcceleratedAWSDeployment-Pipeline

⚠ Needs attention