NVIDIA ASR Implementations on Modal

This repository demonstrates approaches to deploy NVIDIA's Nemotron-Speech ASR and Parakeet ASR models on Modal for batch and streaming transcription.

Install and setup Modal

Make an account with Modal if you haven't already.

pip install modal

Authenticate your Modal account:

modal setup

Nemotron-Speech ASR

Nemotron-Speech ASR is a powerful open weights model that can stream high numbers of concurrent clients. It outputs partial and final transcripts, capitlization and punctuation, and has word boosting capabilities for domain-specific vocabulary.

NeMO does not currently provide an implementation for asynchronous, concurrent clients. This codebase includes extensions to NeMO that enable batched, cache-aware inference on asynchronous streaming clients. It is based on NeMO's inference Pipeline.

Deployment

modal deploy -m nemotron_asr.nemotron_asr

Batch and Streaming Parakeet

Implementation

1. Batch Transcription (`parakeet/parakeet.py`)

The core Parakeet transcriber runs on GPU and handles both single audio files and batches:

Accepts audio as bytes or list[bytes]
Processes batches up to BATCH_SIZE = 128 for efficient GPU utilization
Exposes a Modal method that can be called from anywhere

2. Streaming with VAD Segmentation (`parakeet/vad_segmenter.py`)

For Parakeet models that don't natively support streaming, we use Voice Activity Detection (VAD) to segment the stream:

Audio Stream → VAD Segmenter (CPU) → Parakeet Transcriber (GPU)

The VAD segmenter:

Runs as a separate Modal function (CPU-only, no GPU)
Uses Silero VAD (via Pipecat's wrapper) to detect speech start/stop in the audio stream (see Pipecat's docs for settings)
Buffers audio during speech and segments it when speech ends
Calls the Parakeet transcriber endpoint with batch_size = 1 for each segment

Why separate the VAD from transcription? This architecture enables independent autoscaling and better GPU utilization. Multiple VAD segmenters (cheap CPU) can feed a smaller pool of GPU transcribers, so GPUs only run when there's actual speech to transcribe.

3. Native Streaming Transcription (`parakeet/parakeet_streaming.py`)

The newest approach uses NVIDIA's Parakeet Realtime model (nvidia/parakeet_realtime_eou_120m-v1) with native streaming support:

Audio Stream → Parakeet Realtime (GPU) → Transcription

Key features:

No VAD required — the model processes audio chunks directly as they arrive
Uses NemoStreamingASRService with built-in end-of-utterance (EOU) detection
Processes audio in 80ms chunks for low-latency transcription
Single GPU-based service handles both audio ingestion and transcription
WebSocket-based streaming interface

This approach offers the lowest latency and simplest architecture since everything runs in one place, but requires GPU for the entire audio stream (not just during speech).

4. Multi-Speaker Native Streaming (`parakeet/parakeet_multitalker.py`)

The most advanced approach combines real-time speaker diarization with multi-talker ASR for streaming transcription with speaker labels:

Audio Stream → Sortformer Diarization + Multi-talker Parakeet (GPU) → Speaker-tagged Transcription

Key features:

Multi-speaker support — automatically separates and transcribes up to 4 concurrent speakers
Uses NVIDIA's multitalker-parakeet-streaming-0.6b-v1 model with diar_streaming_sortformer_4spk-v2.1 diarization
Cache-aware buffering — intelligent audio buffering that aligns with model's cache requirements
Processes audio in 80ms frames with 13-frame buffer for optimal streaming performance
WebSocket-based streaming interface with speaker-tagged output
Single GPU-based service handles diarization and multi-speaker transcription simultaneously

This approach is ideal for scenarios with multiple speakers (meetings, conversations, interviews) where you need to know "who said what" in real-time.

Deploy

# 1. Batch transcription
modal deploy -m parakeet.parakeet

# 2. Streaming with VAD segmentation
modal deploy -m parakeet.vad_segmenter

# 3. Native streaming transcription (Parakeet Realtime)
modal deploy -m parakeet.parakeet_streaming

# 4. Multi-speaker native streaming transcription
modal deploy -m parakeet.parakeet_multitalker

Frontend

The parakeet-frontend/ directory contains a simple web interface for testing streaming transcription via WebSocket.

When you deploy any streaming version:

VAD segmentation (vad_segmenter): Frontend URL will be printed to console with format:
```
https://{workspace}-{environment}--silero-vad-segmenter-webserver-web.modal.run
```

Native streaming (parakeet_streaming): Frontend URL will be printed to console with format:

https://{workspace}-{environment}--parakeet-streaming-transcription-{shorten-id}.modal.run

Multi-speaker streaming (parakeet_multitalker): Frontend URL will be printed to console with format:
```
https://{workspace}-{environment}--parakeet-multitalker-webserver-web.modal.run
```

Name		Name	Last commit message	Last commit date
Latest commit History 43 Commits
nemotron-asr-frontend		nemotron-asr-frontend
nemotron_asr		nemotron_asr
nim_voice_agent		nim_voice_agent
parakeet-frontend		parakeet-frontend
parakeet		parakeet
.gitignore		.gitignore
.python-version		.python-version
README.md		README.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

NVIDIA ASR Implementations on Modal

Install and setup Modal

Nemotron-Speech ASR

Deployment

Batch and Streaming Parakeet

Implementation

1. Batch Transcription (`parakeet/parakeet.py`)

2. Streaming with VAD Segmentation (`parakeet/vad_segmenter.py`)

3. Native Streaming Transcription (`parakeet/parakeet_streaming.py`)

4. Multi-Speaker Native Streaming (`parakeet/parakeet_multitalker.py`)

Deploy

Frontend

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

NVIDIA ASR Implementations on Modal

Install and setup Modal

Nemotron-Speech ASR

Deployment

Batch and Streaming Parakeet

Implementation

1. Batch Transcription (parakeet/parakeet.py)

2. Streaming with VAD Segmentation (parakeet/vad_segmenter.py)

3. Native Streaming Transcription (parakeet/parakeet_streaming.py)

4. Multi-Speaker Native Streaming (parakeet/parakeet_multitalker.py)

Deploy

Frontend

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

1. Batch Transcription (`parakeet/parakeet.py`)

2. Streaming with VAD Segmentation (`parakeet/vad_segmenter.py`)

3. Native Streaming Transcription (`parakeet/parakeet_streaming.py`)

4. Multi-Speaker Native Streaming (`parakeet/parakeet_multitalker.py`)

Packages