nvidia-RNN-T-Parakeet · davidbmar.com

What it is

This project provides a production-ready FastAPI wrapper around NVIDIA's Parakeet RNN-T model, served via the NVIDIA Riva ASR framework. It enables ultra-low latency (~100ms) real-time transcription with streaming WebSocket support, word-level timestamps, and optional AWS integrations for audio ingestion and event handling.

Features

Ultra-low latency transcription (~100ms) using Parakeet RNN-T
Real-time streaming support via WebSockets
GPU-optimized inference using NVIDIA Riva and TensorRT
Word-level timestamp generation for precise alignment
Dockerized deployment with health checks and CORS support
Optional AWS integration for S3 audio processing and EventBridge events

Quickstart

./scripts/step-005-setup-parakeet-environment.sh
./scripts/step-010-install-parakeet-dependencies.sh
./scripts/step-015-download-parakeet-model.sh
./scripts/step-020-test-parakeet-inference.sh
./scripts/step-025-setup-gpu-environment.sh

Architecture

flowchart TD
    Client[Client App] -->|WebSocket/HTTP| API[FastAPI Server]
    API -->|gRPC| Riva[NVIDIA Riva Service]
    Riva -->|TensorRT| GPU[NVIDIA GPU]
    GPU -->|Inference Result| Riva
    Riva -->|Transcription| API
    API -->|Store/Event| AWS[AWS S3 / EventBridge]
    subgraph Docker Container
        API
        Riva
    end

How it's built

The system is built as a Dockerized Python application. It uses a series of numbered Shell scripts to handle environment setup, dependency installation, model downloading from NGC, and GPU configuration. The core service is a FastAPI server that interfaces with the Riva client SDK to perform inference on the Parakeet 1.1B English model.

How it runs

sequenceDiagram
    participant C as Client
    participant F as FastAPI Server
    participant R as Riva Service
    participant G as GPU
    C->>F: POST /transcribe or WS Connect
    F->>R: StreamingRecognize Request (audio chunks)
    R->>G: Execute Parakeet Model Inference
    G-->>R: Return Logits/Text
    R-->>F: Partial/Final Transcription Results
    F-->>C: JSON Response with Text & Timestamps

How to apply & reuse

Deploy this system where low-latency transcription is critical, such as live captioning, voice assistants, or real-time meeting notes. It requires an NVIDIA GPU with CUDA support and an NGC API key. It can be integrated into larger pipelines via its REST API or WebSocket endpoints, with optional hooks for AWS S3 and EventBridge.

At a glance

CapabilitiesSpeech-to-Text TranscriptionReal-time Streaming Audio ProcessingWord-level Timing AnalysisREST API ServingWebSocket CommunicationCloud Storage Integration

ComponentsFastAPI ApplicationNVIDIA Riva ASR ClientParakeet RNN-T ModelDocker RuntimeSetup Scripts (Shell)AWS Boto3 Client

TechPythonShell ScriptFastAPINVIDIA RivaDockergRPCCUDATensorRT

Depends onNVIDIA GPU with CUDANGC API KeyDocker with nvidia-container-toolkitAWS CLI (optional)

Integrates withAWS S3AWS EventBridgeAWS SQSWeb Browsers (via WebSocket)

PatternsMicroservicesProducer-Consumer (Audio Stream)Wrapper Pattern (Riva Client)Infrastructure as Code (Scripts)

Reuse tagsspeech-recognitionnvidia-rivaparakeet-modelfastapigpu-accelerationreal-time-transcription