Real-Time ASR Service with NVIDIA NeMo Parakeet

What it is

This project implements a low-latency Automatic Speech Recognition (ASR) service designed for real-time transcription. It leverages the NVIDIA NeMo framework and the Parakeet-CTC-0.6B model to convert spoken audio into text. The system accepts audio streams via WebSockets from a browser client, processes them through an Audio Gateway for normalization, filters silence using Silero Voice Activity Detection (VAD), and performs streaming inference to deliver both partial and final transcripts. It is built with Python 3.12 and FastAPI, supporting GPU acceleration for high concurrency.

Features

Real-time streaming transcription via WebSockets with <500ms p95 latency
Integrated Silero VAD for efficient silence filtering and audio segmentation
Supports partial and final transcript stabilization for responsive UI updates
GPU-accelerated inference using NVIDIA NeMo Parakeet-CTC-0.6B model
Built-in browser demo UI for immediate testing and integration reference
Session management with configurable concurrency limits and timeout handling

Quickstart

pip install -r requirements.txt
python download_models.py
python main.py
open http://localhost:8000

Architecture

flowchart TD
    Client[Browser Client] -->|WebSocket Audio Stream| GW[Audio Gateway]
    GW -->|Preprocessed Chunks| VAD[Silero VAD]
    VAD -->|Active Speech Segments| ASR[Nemo Parakeet ASR]
    ASR -->|Partial/Final Text| SM[Session Manager]
    SM -->|Transcript JSON| Client
    subgraph Backend
        GW
        VAD
        ASR
        SM
    end

How it's built

The backend is constructed using FastAPI for asynchronous HTTP and WebSocket handling. Core ASR logic relies on NVIDIA NeMo's `EncDecCTCModel` for transcription and PyTorch for Silero VAD. Audio processing includes resampling and normalization within a custom Audio Gateway. Session management handles concurrent WebSocket connections, ensuring resource limits are respected. The frontend is a lightweight HTML/JS demo that captures microphone input and streams it to the backend. Deployment is containerized via Docker with GPU support (`--gpus all`).

How it runs

sequenceDiagram
    participant Browser
    participant WS as WebSocket Handler
    participant GW as Audio Gateway
    participant VAD as Silero VAD
    participant ASR as NeMo Engine
    
    Browser->>WS: Connect & Start Session
    WS->>WS: Register Session
    loop Audio Stream
        Browser->>WS: Send Audio Chunk
        WS->>GW: Preprocess (Resample/Normalize)
        GW->>VAD: Check Voice Activity
        alt Speech Detected
            VAD->>ASR: Forward Audio Segment
            ASR->>ASR: Streaming Inference
            ASR-->>WS: Return Partial/Final Transcript
            WS-->>Browser: Send Transcript Update
        else Silence
            VAD-->>WS: Ignore Chunk
        end
    end

How to apply & reuse

Use this service as a backend for voice-enabled web applications, meeting transcription tools, or live captioning systems. It is suitable for developers needing a self-hosted, GPU-accelerated alternative to cloud ASR APIs. The modular architecture allows swapping the ASR model or VAD component. It integrates easily into existing Python microservices ecosystems via its REST/WebSocket interface.

At a glance

CapabilitiesStreaming ASRVoice Activity DetectionWebSocket CommunicationGPU AccelerationSession ManagementAPI Key Authentication

ComponentsFastAPI ServerNeMo ASR EngineSilero VAD ModuleAudio GatewaySession ManagerBrowser Demo UI

TechPython 3.12FastAPINVIDIA NeMoPyTorchWebSocketsDocker

Depends onnemo-toolkittorchfastapiuvicornwebsocketsnumpy

Integrates withWeb BrowsersGPU Hardware (CUDA)REST Clients

PatternsProducer-ConsumerSingleton Model LoadingAsync WebSocket HandlingMiddleware Authentication

Reuse tagsasrspeech-to-textnvidia-nemoreal-timewebsocketfastapivad