NVIDIA Blackwell Dominates MLPerf Training v6.0 Benchmarks

NVIDIA has once again raised the bar in AI performance, delivering a clean sweep of the MLPerf Training v6.0 benchmarks, the latest industry-standard test for AI model training. The company reported the fastest time-to-train results across all benchmarks, showcasing the capability of its Blackwell GPUs and Grace CPUs in scaling up to hyperscale workloads.

One standout achievement was the training of DeepSeek-V3, a 671-billion parameter mixture-of-experts (MoE) model, in just 2.02 minutes using an 8,192-GPU cluster. This result underscores NVIDIA's dominance in the high-performance AI landscape, particularly for large-scale, generative AI models that demand immense computational power.

Key Results and Metrics

MLPerf Training v6.0 introduced new benchmarks like the DeepSeek-V3 and GPT-OSS-20B models, reflecting evolving trends in AI. NVIDIA was the only platform to submit results for every test, further solidifying its leadership position. Highlights from the benchmarks include:

DeepSeek-V3 (671B): Trained in 2.02 minutes with 8,192 GPUs using the GB300 NVL72 platform.
GPT-OSS-20B: Completed training in 7.43 minutes on a 512-GPU cluster.
Llama 3.1 (405B): Achieved training in 7.07 minutes with 8,192 GPUs.

These results demonstrate the scalability of NVIDIA's hardware and software stack, which includes advanced networking solutions like NVLink and Spectrum-X Ethernet to ensure high-speed communication across thousands of processors.

Fueling Performance with Full-Stack Co-Design

NVIDIA's success isn't just about hardware. The company’s software stack played a critical role in achieving these record-breaking results. Innovations included:

Full-iteration CUDA graphs for token-dropless MoEs, eliminating CPU-GPU synchronization delays.
Kernel fusions enabled by the CuTe DSL, which reduced memory bottlenecks and improved efficiency.
Introduction of MXFP8 attention blocks, cutting precision costs while maintaining model quality.

These optimizations delivered not just higher speeds but also better utilization of GPU resources, reducing overall training costs for enterprises.

Why MLPerf Results Matter

MLPerf benchmarks, developed by the MLCommons consortium, have become the gold standard for measuring AI training performance. For enterprises, these results directly impact procurement decisions and infrastructure strategies. As generative AI models grow in size and complexity, the ability to train them quickly and efficiently has become a competitive advantage.

NVIDIA's achievements in MLPerf Training v6.0 come against the backdrop of fierce competition with other AI chipmakers and cloud providers. While CoreWeave claimed the fastest closed-division result for available-cloud configurations, NVIDIA's hardware and software demonstrated unmatched consistency across every test, making it the go-to choice for hyperscalers and AI startups alike.

Looking Ahead

NVIDIA's focus on full-stack innovation ensures a continuous trajectory of performance improvements. The latest advancements in its Megatron Core and Transformer Engine libraries highlight the company's ability to deliver significant gains through software updates, even on existing hardware. This positions NVIDIA well as enterprises scale up their AI ambitions.

For developers, hyperscalers, and enterprises, the MLPerf Training v6.0 results reaffirm NVIDIA's dominance in the AI training space. With proven scalability on clusters as large as 8,192 GPUs, the platform is uniquely equipped to handle the next generation of AI workloads, compressing months of training into mere minutes.

NVIDIA Blackwell Dominates MLPerf Training v6.0 Benchmarks

Key Results and Metrics

Fueling Performance with Full-Stack Co-Design

Why MLPerf Results Matter

Looking Ahead

Read More