Production-ready deployment pipeline for NVIDIA Parakeet RNNT speech recognition via Riva ASR with WebSocket streaming.
https://github.com/davidbmar/nvidia-parakeet-ver-5 · public · shipped
A comprehensive infrastructure project that deploys NVIDIA's Parakeet RNNT automatic speech recognition (ASR) model using either NVIDIA NIM containers or traditional Riva servers. It provides a Python-based WebSocket server and client wrapper to enable real-time, low-latency audio transcription with word-level timestamps and confidence scores.
git clone https://github.com/davidbmar/nvidia-parakeet.git cd nvidia-parakeet-3 ./scripts/riva-010-run-complete-deployment-pipeline.sh
flowchart TD
Client["Client Apps<br/>(Browser/Node.js/Python)"] -->|"WebSocket Audio Stream<br/>Port 8443"| WS["WebSocket Server<br/>(FastAPI + RivaASRClient)"]
WS -->|"gRPC Call<br/>Port 50051"| Riva["NVIDIA Riva ASR<br/>(NIM or Traditional)<br/>Parakeet RNNT Model"]
Riva -->|"GPU Inference<br/>(Tesla T4/V100)"| GPU["GPU Worker"]
WS -->|"Structured Logs"| Logs["Logging & Monitoring<br/>(./logs/)"]
subgraph Deployment
WS
Riva
GPU
end
The system is built on Python 3.10+ and uses FastAPI for the WebSocket server interface. It integrates with NVIDIA Riva ASR via gRPC (port 50051) or HTTP (port 8000) using a custom `RivaASRClient` wrapper that supports both real GPU-accelerated inference and mock modes. Deployment is managed through Bash scripts targeting AWS EC2 GPU instances (g4dn.xlarge), handling driver installation, Docker/NVIDIA Container Runtime setup, and security group configuration.
sequenceDiagram
participant C as Client App
participant W as WebSocket Server
participant RC as RivaASRClient
participant R as NVIDIA Riva ASR
C->>W: Connect WebSocket (ws://.../ws/transcribe)
W->>C: Connection Established
loop Streaming Audio
C->>W: Send Audio Chunk (PCM16)
W->>RC: Process Audio Chunk
RC->>R: gRPC StreamingRecognize Request
R-->>RC: Partial Result (Interim)
RC-->>W: Return Partial Transcription
W-->>C: Send Partial JSON Result
end
C->>W: End Stream / Silence Detected
W->>RC: Finalize Request
RC->>R: Get Final Result
R-->>RC: Final Transcription + Timestamps
RC-->>W: Return Final Result
W-->>C: Send Final JSON Result
Use this system to deploy a scalable, low-latency speech-to-text service for applications requiring real-time transcription, such as live captioning, voice assistants, or meeting recorders. It is suitable for environments where GPU acceleration (Tesla T4/V100) is available and sub-second latency is critical.