A production-ready streaming speech-to-text API using NVIDIA's Parakeet-CTC-0.6B model, featuring WebSocket audio ingestion, Silero VAD, and FastAPI.
https://github.com/davidbmar/2026-jan-voice-speech-nemo-framework-model-Parakeet-CTC-0.6B-nvidia-asr-service · public · shipped
This project implements a low-latency Automatic Speech Recognition (ASR) service designed for real-time transcription. It leverages the NVIDIA NeMo framework and the Parakeet-CTC-0.6B model to convert spoken audio into text. The system accepts audio streams via WebSockets from a browser client, processes them through an Audio Gateway for normalization, filters silence using Silero Voice Activity Detection (VAD), and performs streaming inference to deliver both partial and final transcripts. It is built with Python 3.12 and FastAPI, supporting GPU acceleration for high concurrency.
pip install -r requirements.txt python download_models.py python main.py open http://localhost:8000
flowchart TD
Client[Browser Client] -->|WebSocket Audio Stream| GW[Audio Gateway]
GW -->|Preprocessed Chunks| VAD[Silero VAD]
VAD -->|Active Speech Segments| ASR[Nemo Parakeet ASR]
ASR -->|Partial/Final Text| SM[Session Manager]
SM -->|Transcript JSON| Client
subgraph Backend
GW
VAD
ASR
SM
end
The backend is constructed using FastAPI for asynchronous HTTP and WebSocket handling. Core ASR logic relies on NVIDIA NeMo's `EncDecCTCModel` for transcription and PyTorch for Silero VAD. Audio processing includes resampling and normalization within a custom Audio Gateway. Session management handles concurrent WebSocket connections, ensuring resource limits are respected. The frontend is a lightweight HTML/JS demo that captures microphone input and streams it to the backend. Deployment is containerized via Docker with GPU support (`--gpus all`).
sequenceDiagram
participant Browser
participant WS as WebSocket Handler
participant GW as Audio Gateway
participant VAD as Silero VAD
participant ASR as NeMo Engine
Browser->>WS: Connect & Start Session
WS->>WS: Register Session
loop Audio Stream
Browser->>WS: Send Audio Chunk
WS->>GW: Preprocess (Resample/Normalize)
GW->>VAD: Check Voice Activity
alt Speech Detected
VAD->>ASR: Forward Audio Segment
ASR->>ASR: Streaming Inference
ASR-->>WS: Return Partial/Final Transcript
WS-->>Browser: Send Transcript Update
else Silence
VAD-->>WS: Ignore Chunk
end
end
Use this service as a backend for voice-enabled web applications, meeting transcription tools, or live captioning systems. It is suitable for developers needing a self-hosted, GPU-accelerated alternative to cloud ASR APIs. The modular architecture allows swapping the ASR model or VAD component. It integrates easily into existing Python microservices ecosystems via its REST/WebSocket interface.
✓ all on main — nothing unmerged.