nvidia-parakeet · davidbmar.com

What it is

A comprehensive infrastructure wrapper that deploys NVIDIA Riva ASR (using the Parakeet RNNT model) on GPU instances. It provides a Python client wrapper (`RivaASRClient`) and a WebSocket server to enable real-time, low-latency speech-to-text transcription with word-level timestamps and confidence scores. The system supports both modern NIM containers and traditional Riva server setups, featuring structured logging and mock modes for development.

Features

Real-time NVIDIA Parakeet RNNT transcription via Riva ASR
Dual deployment support: NIM containers and traditional Riva servers
WebSocket streaming API for low-latency partial and final results
Word-level timestamps and confidence scores
Comprehensive structured logging and monitoring framework
Mock mode fallback for development without GPU hardware

Quickstart

git clone https://github.com/davidbmar/nvidia-parakeet.git
cd nvidia-parakeet-3
./scripts/riva-010-run-complete-deployment-pipeline.sh

Architecture

flowchart TD
    Client["Client Apps<br/>(Browser/Node/Python)"] -->|"WebSocket Audio Stream"| WS["WebSocket Server<br/>(Port 8443)<br/>FastAPI + RivaASRClient"]
    WS -->|"gRPC Client Wrapper"| Riva["NVIDIA Riva ASR<br/>(NIM or Traditional)<br/>Parakeet RNNT Model"]
    Riva -->|"GPU Inference"| GPU["Tesla T4/V100<br/>GPU Worker"]
    WS -->|"Structured Logs"| Logs["Logging &<br/>Monitoring System"]
    subgraph Deployment
        WS
        Riva
        GPU
        Logs
    end

How it's built

The system is built using Shell scripts for infrastructure provisioning (AWS EC2, driver installation, Docker/NVIDIA container runtime setup) and Python for the application layer. The core ASR logic relies on NVIDIA Riva gRPC services, wrapped by a custom `RivaASRClient` in `src/asr/riva_client.py`. A FastAPI-based WebSocket server (`rnnt-https-server.py` / `docker/rnnt-server-websocket.py`) bridges client audio streams to the Riva backend. Configuration is managed via Pydantic settings and environment variables.

How it runs

sequenceDiagram
    participant C as Client App
    participant W as WebSocket Server
    participant RC as RivaASRClient
    participant R as NVIDIA Riva ASR
    
    C->>W: Connect WebSocket (ws://.../transcribe)
    W->>C: Connection Established
    loop Streaming Audio
        C->>W: Send Audio Chunk (Binary)
        W->>RC: Forward Audio Data
        RC->>R: gRPC StreamingRecognize Request
        R-->>RC: Partial Result (Interim)
        RC-->>W: Return Partial Transcription
        W-->>C: Send Partial JSON Result
    end
    C->>W: End Stream / Silence
    RC->>R: Finalize Request
    R-->>RC: Final Result (Confidence/Timestamps)
    RC-->>W: Return Final Transcription
    W-->>C: Send Final JSON Result
    W->>C: Close WebSocket

How to apply & reuse

Use this project to deploy a scalable, GPU-accelerated speech recognition service on AWS or local hardware. It is suitable for applications requiring real-time transcription (e.g., live captioning, voice assistants) where low latency (~100-300ms partial results) and high throughput are critical. Developers can integrate the provided Node.js or Python WebSocket clients into their frontend or backend services.

At a glance

CapabilitiesReal-time Speech RecognitionGPU-Accelerated InferenceWebSocket StreamingAWS Infrastructure AutomationStructured LoggingMock Mode Testing

Componentssrc/asr/riva_client.pydocker/rnnt-server-websocket.pydocker/rnnt-server.pyconfig/settings.pyscripts/riva-010-run-complete-deployment-pipeline.shexamples/python-client.pyexamples/nodejs-client.js

TechShellPythonNVIDIA RivaFastAPIWebSocketsDockerPydanticgRPC

Depends onAWS EC2 (g4dn.xlarge+)NVIDIA GPU DriversDocker & NVIDIA Container ToolkitPython 3.10+NVIDIA Riva ServiceMaker

Integrates withNVIDIA NIMTraditional Riva ServerAWS S3 (Audio Storage)Browser ClientsNode.js Applications

PatternsClient-ServerStreaming RPCWrapper FacadeInfrastructure as Code (Shell)Observer (Logging)

Reuse tagsasrspeech-to-textnvidia-rivawebsocket-servergpu-inferenceaws-deployment