Production-ready deployment system for NVIDIA Parakeet RNNT ASR via Riva with WebSocket streaming.
https://github.com/davidbmar/nvidia-parakeet · public · shipped
A comprehensive infrastructure wrapper that deploys NVIDIA Riva ASR (using the Parakeet RNNT model) on GPU instances. It provides a Python client wrapper (`RivaASRClient`) and a WebSocket server to enable real-time, low-latency speech-to-text transcription with word-level timestamps and confidence scores. The system supports both modern NIM containers and traditional Riva server setups, featuring structured logging and mock modes for development.
git clone https://github.com/davidbmar/nvidia-parakeet.git cd nvidia-parakeet-3 ./scripts/riva-010-run-complete-deployment-pipeline.sh
flowchart TD
Client["Client Apps<br/>(Browser/Node/Python)"] -->|"WebSocket Audio Stream"| WS["WebSocket Server<br/>(Port 8443)<br/>FastAPI + RivaASRClient"]
WS -->|"gRPC Client Wrapper"| Riva["NVIDIA Riva ASR<br/>(NIM or Traditional)<br/>Parakeet RNNT Model"]
Riva -->|"GPU Inference"| GPU["Tesla T4/V100<br/>GPU Worker"]
WS -->|"Structured Logs"| Logs["Logging &<br/>Monitoring System"]
subgraph Deployment
WS
Riva
GPU
Logs
end
The system is built using Shell scripts for infrastructure provisioning (AWS EC2, driver installation, Docker/NVIDIA container runtime setup) and Python for the application layer. The core ASR logic relies on NVIDIA Riva gRPC services, wrapped by a custom `RivaASRClient` in `src/asr/riva_client.py`. A FastAPI-based WebSocket server (`rnnt-https-server.py` / `docker/rnnt-server-websocket.py`) bridges client audio streams to the Riva backend. Configuration is managed via Pydantic settings and environment variables.
sequenceDiagram
participant C as Client App
participant W as WebSocket Server
participant RC as RivaASRClient
participant R as NVIDIA Riva ASR
C->>W: Connect WebSocket (ws://.../transcribe)
W->>C: Connection Established
loop Streaming Audio
C->>W: Send Audio Chunk (Binary)
W->>RC: Forward Audio Data
RC->>R: gRPC StreamingRecognize Request
R-->>RC: Partial Result (Interim)
RC-->>W: Return Partial Transcription
W-->>C: Send Partial JSON Result
end
C->>W: End Stream / Silence
RC->>R: Finalize Request
R-->>RC: Final Result (Confidence/Timestamps)
RC-->>W: Return Final Transcription
W-->>C: Send Final JSON Result
W->>C: Close WebSocket
Use this project to deploy a scalable, GPU-accelerated speech recognition service on AWS or local hardware. It is suitable for applications requiring real-time transcription (e.g., live captioning, voice assistants) where low latency (~100-300ms partial results) and high throughput are critical. Developers can integrate the provided Node.js or Python WebSocket clients into their frontend or backend services.
✓ all on main — nothing unmerged.