NVIDIA Releases Open-Source ASR for Low-Latency Speech

NVIDIA Releases Open-Source ASR for Low-Latency Speech

The demand for truly interactive and natural conversational AI has intensified the search for speech recognition technology that can operate in real time without the frustrating delays that break the flow of human-computer interaction. NVIDIA has now entered this critical arena with Nemotron Speech, a new open-source Automatic Speech Recognition (ASR) model meticulously engineered to address the core challenges of low-latency applications. This model is positioned as a foundational component for next-generation systems like responsive voice agents, live captioning services, and other interactive platforms where immediate and precise transcription is non-negotiable. Available to the developer community on Hugging Face, this tool is optimized for high performance on modern NVIDIA GPUs, adeptly handling both streaming and batch processing. Its primary value lies in its remarkable ability to sustain high-accuracy transcriptions while maintaining exceptionally low and, most importantly, stable latency, even when subjected to the strain of high concurrent user loads.

A New Architecture for Real-Time Speed

At the heart of Nemotron Speech lies a sophisticated 600-million-parameter architecture, which integrates a cache-aware FastConformer encoder with a Recurrent Neural Network Transducer (RNNT) decoder. The encoder, composed of 24 distinct layers, provides a deep and powerful capacity for extracting complex audio features. A pivotal design element enabling its impressive speed is the encoder’s implementation of aggressive 8x convolutional downsampling. This technique drastically reduces the number of time steps the model must process for any given audio segment, which directly translates into significantly lower computational overhead and more efficient memory utilization. These efficiencies are fundamental to achieving the performance required for real-time streaming applications. Standardized to process 16 kHz mono audio, the model operates on a chunk-based system, requiring a minimum of just 80 milliseconds of audio to begin its transcription process, ensuring a rapid response from the very first utterance.

This advanced architecture was purposefully developed to overcome the well-documented shortcomings of conventional streaming ASR systems. Many existing models employ a buffered, sliding-window approach, a method where each new audio chunk is processed alongside an overlapping segment of the preceding chunk to preserve conversational context. While this technique can be effective for context preservation, it is inherently inefficient due to the redundant computation performed on the overlapping audio frames. This inefficiency leads to wasted GPU cycles and, more critically for user experience, causes latency to drift upwards in an unpredictable manner as the number of concurrent users increases. Such latency drift degrades the perceived quality of the interaction, making it difficult to achieve the seamless, natural dialogue that applications like real-time voice agents demand. Nemotron Speech was designed from the ground up to eliminate this foundational bottleneck.

The Cache-Aware Streaming Advantage

In a significant departure from traditional methods, Nemotron Speech introduces a more elegant and efficient “cache-aware” streaming mechanism that systematically eliminates redundant processing. Instead of repeatedly analyzing overlapping audio segments, the model maintains a persistent cache of the encoder’s internal states across all of its self-attention and convolution layers. As a new chunk of audio arrives, it is processed only a single time. The model then intelligently reuses the cached activations from all previous chunks to construct and maintain a complete and accurate contextual understanding of the entire audio stream. This approach ensures that the computational cost remains directly proportional to the length of the audio, without the compounding inefficiencies that plague sliding-window systems. The result is a system that is not only faster but also far more predictable and reliable under real-world conditions where user loads can fluctuate dramatically.

This innovative design yields three principal advantages that directly address the limitations of older methods and are essential for building high-quality voice applications. First, since every audio frame is processed only once, the computational workload scales in a perfectly linear fashion with the audio’s length, avoiding any unpredictable spikes in processing cost. Second, the memory footprint grows predictably with the sequence length of a single stream, rather than being duplicated and compounded by the number of concurrent streams, making resource management far more reliable. Most importantly, this computational efficiency ensures that latency remains remarkably stable and low, even as hundreds of users interact with the system simultaneously. This stability is the key enabler for natural turn-taking, barge-in capabilities, and seamless interruption handling in modern conversational AI, creating a much more fluid and human-like user experience.

Performance, Flexibility, and Open Access

A key feature of Nemotron Speech is the fine-grained control it offers developers over the model’s performance profile at inference time, without any need for retraining. This flexibility is managed through the att_context_size parameter, which adjusts the left and right attention context in multiples of 80 ms frames. The model exposes four standard configurations, allowing teams to strike the perfect balance between latency and accuracy for their specific use case. These options range from an ultra-responsive 80 ms chunk size to a highly accurate 1.12 s chunk size. On a comprehensive average of standard datasets from the Hugging Face OpenASR leaderboard, the model demonstrates this powerful trade-off clearly. At a 160 ms chunk size, it achieves a Word Error Rate (WER) of approximately 7.84%, while increasing the chunk size to 560 ms improves the WER to 7.22%. At its most accurate setting of 1.12 s, the WER reaches about 7.16%, showcasing how larger chunks provide more phonetic context for slightly better accuracy, while smaller chunks deliver lower latency.

The impact of the cache-aware design on throughput and concurrency is substantial, enabling unprecedented scalability. On an NVIDIA #00 GPU, Nemotron Speech can process approximately 560 concurrent streams at a 320 ms chunk size, which represents a threefold increase in capacity compared to a baseline sliding-window system operating at the same latency. The performance advantages are even more pronounced on other hardware, with tests showing over a fivefold increase in concurrency on an RTX A5000 and up to a twofold increase on a DGX B200. This raw performance has been validated in real-world scenarios. A test involving 127 concurrent WebSocket clients demonstrated that the system maintained a median end-to-end delay of around 182 ms in its 560 ms mode, with no observable latency drift over extended sessions. This result confirms the model’s suitability for long-running, demanding conversational applications where consistent performance is paramount.

Fostering a New Generation of Conversational AI

The robustness of Nemotron Speech is firmly rooted in its extensive training and its open-source availability, which together foster broad community adoption and further development. The model was trained on a massive dataset comprising approximately 285,000 hours of audio. The primary source for this training was the English portion of NVIDIA’s proprietary Granary dataset, which was supplemented with a diverse mix of public speech corpora, including data from YouTube Commons, Fisher, Switchboard, and multiple releases of Mozilla Common Voice. The labels used for this training were a combination of human-generated and ASR-generated transcripts, allowing the model to learn from a vast and varied set of speech patterns, accents, and acoustic environments. By releasing the model under the permissive NVIDIA Open Model License, which includes open access to the model weights and training details, development teams are empowered to self-host, fine-tune it on their specific data, and perform in-depth profiling to optimize it for their unique low-latency voice applications.

Ultimately, the release of Nemotron Speech represented a pivotal moment for the development of advanced voice interfaces. This ASR model was expertly contextualized as a single, highly optimized component within a complete voice-to-voice agent pipeline. In a reference stack that combined Nemotron Speech with the Nemotron 3 Nano 30B language model and Magpie Text-to-Speech (TTS), the median time to final transcription was a mere 24 milliseconds. The total server-side voice-to-voice latency was around 500 milliseconds on an RTX 5090, a figure that highlighted how the ASR component constituted only a small fraction of the total latency budget. This testament to its high level of optimization established the model not just as a tool, but as a foundational element that enabled developers to build the next generation of truly interactive and responsive conversational AI systems.

Subscribe to our weekly news digest.

Join now and become a part of our fast-growing community.

Invalid Email Address
Thanks for Subscribing!
We'll be sending you our best soon!
Something went wrong, please try again later