Together AI Claims Fastest Speech-to-Text Stack with Parakeet v3
Together AI has announced what it claims to be the fastest speech-to-text (ASR) stack in the world, capable of transcribing 20 hours of speech in under 10 seconds. The breakthrough leverages NVIDIA’s Parakeet-TDT 0.6B v3 and OpenAI’s Whisper Large v3, both optimized for low-latency and high-throughput applications. This development could significantly advance real-time voice AI systems, a key area of focus for the company as it scales its infrastructure.
The heart of Together AI's achievement lies in treating ASR as a full-path systems problem, rather than focusing solely on GPU inference. This holistic approach addresses bottlenecks across preprocessing, GPU execution, memory management, and networking. For example, innovations like TensorRT profile tuning, conditional CUDA graphs, and zero-copy data paths have drastically reduced latency across the stack.
One standout optimization is the decoder loop in Parakeet v3. By moving conditional logic from the CPU to the GPU, Together AI eliminated costly synchronization delays, resulting in a 2-3x speedup for decoding. Similarly, the use of shared memory and evented I/O for streaming transcription has minimized overhead, ensuring both high throughput and low jitter for real-time applications.
Parakeet v3, a multilingual ASR model trained on 1.7 million hours of audio, represents a major leap from its predecessor. It now supports 25 European languages, includes automatic language detection, and retains its industry-leading performance for English transcription. Together AI’s platform also integrates Whisper Large v3 for production-scale workloads, creating a robust ecosystem for developers building voice-driven applications.
Addressing Market Needs
This announcement positions Together AI as a serious contender in the ASR market, particularly for real-time and streaming use cases. Unlike traditional ASR systems that rely on siloed pipelines, Together AI offers a modular stack where speech-to-text (STT), natural language understanding (NLU), and text-to-speech (TTS) can operate cohesively on the same infrastructure. This reduces latency and allows developers to inspect and manipulate intermediate outputs, a key differentiator for real-time voice agents.
Recent partnerships highlight the company’s strategy of building an open, composable ecosystem. In April 2026, Deepgram integrated its ASR models directly onto Together AI’s platform, enabling developers to mix and match specialized speech models with Together AI’s infrastructure. This flexibility is increasingly valuable as AI workloads move toward unified architectures, combining speech, language, and multimodal capabilities.
Industry and Investor Impact
Together AI’s advancements come as the company reportedly seeks to raise capital at a $7.5 billion valuation, according to March 2026 reports. Investor interest reflects the growing demand for high-performance inference infrastructure, especially for voice and multimodal AI systems. With over 450,000 developers and 200 open-source models already supported on its platform, Together AI is well-positioned to capitalize on this momentum.
Competitors like Deepgram and Google still dominate segments of the ASR market, but Together AI’s focus on open-model hosting and real-time performance could carve out significant market share. The integration of NVIDIA’s ASR technology further cements its technical credibility, particularly given NVIDIA’s leadership in AI hardware and software optimization.
As voice interfaces become more integral to consumer and enterprise applications, low-latency and scalable ASR solutions like Together AI’s could redefine user expectations. Developers, investors, and enterprises alike should watch closely as the company continues to refine its stack and expand its ecosystem.