Real-Time ASR Service with NVIDIA NeMo Parakeet

A production-ready streaming speech-to-text API using NVIDIA's Parakeet-CTC-0.6B model, featuring WebSocket audio ingestion, Silero VAD, and FastAPI.

https://github.com/davidbmar/2026-jan-voice-speech-nemo-framework-model-Parakeet-CTC-0.6B-nvidia-asr-service  ·  public  ·  shipped

What it is

This project implements a low-latency Automatic Speech Recognition (ASR) service designed for real-time transcription. It leverages the NVIDIA NeMo framework and the Parakeet-CTC-0.6B model to convert spoken audio into text. The system accepts audio streams via WebSockets from a browser client, processes them through an Audio Gateway for normalization, filters silence using Silero Voice Activity Detection (VAD), and performs streaming inference to deliver both partial and final transcripts. It is built with Python 3.12 and FastAPI, supporting GPU acceleration for high concurrency.

Features

Quickstart

pip install -r requirements.txt
python download_models.py
python main.py
open http://localhost:8000

Architecture

flowchart TD
    Client[Browser Client] -->|WebSocket Audio Stream| GW[Audio Gateway]
    GW -->|Preprocessed Chunks| VAD[Silero VAD]
    VAD -->|Active Speech Segments| ASR[Nemo Parakeet ASR]
    ASR -->|Partial/Final Text| SM[Session Manager]
    SM -->|Transcript JSON| Client
    subgraph Backend
        GW
        VAD
        ASR
        SM
    end

How it's built

The backend is constructed using FastAPI for asynchronous HTTP and WebSocket handling. Core ASR logic relies on NVIDIA NeMo's `EncDecCTCModel` for transcription and PyTorch for Silero VAD. Audio processing includes resampling and normalization within a custom Audio Gateway. Session management handles concurrent WebSocket connections, ensuring resource limits are respected. The frontend is a lightweight HTML/JS demo that captures microphone input and streams it to the backend. Deployment is containerized via Docker with GPU support (`--gpus all`).

How it runs

sequenceDiagram
    participant Browser
    participant WS as WebSocket Handler
    participant GW as Audio Gateway
    participant VAD as Silero VAD
    participant ASR as NeMo Engine
    
    Browser->>WS: Connect & Start Session
    WS->>WS: Register Session
    loop Audio Stream
        Browser->>WS: Send Audio Chunk
        WS->>GW: Preprocess (Resample/Normalize)
        GW->>VAD: Check Voice Activity
        alt Speech Detected
            VAD->>ASR: Forward Audio Segment
            ASR->>ASR: Streaming Inference
            ASR-->>WS: Return Partial/Final Transcript
            WS-->>Browser: Send Transcript Update
        else Silence
            VAD-->>WS: Ignore Chunk
        end
    end

How to apply & reuse

Use this service as a backend for voice-enabled web applications, meeting transcription tools, or live captioning systems. It is suitable for developers needing a self-hosted, GPU-accelerated alternative to cloud ASR APIs. The modular architecture allows swapping the ASR model or VAD component. It integrates easily into existing Python microservices ecosystems via its REST/WebSocket interface.

At a glance

CapabilitiesStreaming ASRVoice Activity DetectionWebSocket CommunicationGPU AccelerationSession ManagementAPI Key Authentication
ComponentsFastAPI ServerNeMo ASR EngineSilero VAD ModuleAudio GatewaySession ManagerBrowser Demo UI
TechPython 3.12FastAPINVIDIA NeMoPyTorchWebSocketsDocker
Depends onnemo-toolkittorchfastapiuvicornwebsocketsnumpy
Integrates withWeb BrowsersGPU Hardware (CUDA)REST Clients
PatternsProducer-ConsumerSingleton Model LoadingAsync WebSocket HandlingMiddleware Authentication
Reuse tagsasrspeech-to-textnvidia-nemoreal-timewebsocketfastapivad

Repo hygiene

✓ all on main — nothing unmerged.