Copied


Mistral AI Launches Voxtral Transcribe 2 With Sub-200ms Latency

Caroline Bishop   Feb 04, 2026 16:53 0 Min Read


Mistral AI dropped its second-generation speech-to-text models on February 4, 2026, with Voxtral Transcribe 2 delivering what the company claims is state-of-the-art transcription at a fraction of competitor pricing. The headline number: $0.003 per minute for batch processing, roughly one-fifth the cost of ElevenLabs' Scribe v2.

The release includes two models serving different use cases. Voxtral Mini Transcribe V2 handles batch jobs with speaker diarization and word-level timestamps. Voxtral Realtime targets live applications with latency configurable down to sub-200 milliseconds—fast enough for voice agents that don't feel sluggish.

Performance Claims Stack Up Against Big Names

Mistral's benchmarks show approximately 4% word error rate on FLEURS, outperforming GPT-4o mini Transcribe, Gemini 2.5 Flash, Assembly Universal, and Deepgram Nova on accuracy. The company also claims 3x faster processing than ElevenLabs' Scribe v2 while matching quality.

At 2.4 seconds delay—suitable for live subtitling—Realtime matches the batch model's accuracy. Drop to 480ms and you're looking at 1-2% additional word error rate, which Mistral positions as acceptable for conversational AI applications.

Open Weights Change the Deployment Math

Voxtral Realtime ships under Apache 2.0, meaning enterprises can deploy on-premise without API calls. With a 4B parameter footprint, the model runs on edge devices—a significant consideration for healthcare, finance, and other sectors where audio data can't leave internal infrastructure.

Both models support GDPR and HIPAA-compliant deployments, addressing the compliance headaches that have slowed enterprise AI adoption.

What's Actually New Since July 2025

When Mistral launched the original Voxtral in July 2025, speaker diarization was notably absent—the company was actively seeking design partners for the feature. This release delivers on that promise, adding precise speaker attribution with start/end times. Context biasing is another addition, letting users feed up to 100 domain-specific terms to improve accuracy on proper nouns and technical vocabulary.

Language support expanded to 13 languages including Chinese, Hindi, Arabic, Japanese, and Korean. Audio file limits jumped to 3 hours per request, up from the original 30-40 minute caps.

Enterprise Implications

The pricing structure creates interesting dynamics for contact centers and meeting intelligence platforms currently paying premium rates for transcription APIs. At $0.003/min, transcribing a million minutes of audio runs $3,000—a number that makes previously cost-prohibitive use cases suddenly viable.

Voxtral Realtime's $0.006/min pricing for streaming applications still undercuts most competitors while enabling real-time sentiment analysis and live agent assist features.

Developers can test both models immediately through Mistral Studio's new audio playground or grab Realtime weights directly from Hugging Face.


Read More