NVIDIA's NVFP4 Boosts JAX Model Training on Blackwell GPUs
NVIDIA has unveiled its new NVFP4 mixed-precision format, designed to accelerate large-scale model training on its Grace Blackwell GPUs. By leveraging 4-bit precision, NVFP4 delivers significant performance gains for tasks like pretraining large language models (LLMs), offering up to 73% faster throughput compared to the FP8 baseline, according to data released on June 8, 2026. These advancements allow AI teams to train larger models in less time, with no measurable accuracy trade-offs.
JAX, a high-performance library popular for machine learning workflows, plays a central role in this breakthrough. NVIDIA integrated NVFP4 into its TransformerEngine and MaxText frameworks, enabling scalable LLM pretraining on Blackwell hardware. Max Xu, the author of the announcement, highlighted that NVFP4 can handle the trillions of tokens and thousands of accelerators involved in modern AI training with unprecedented efficiency.
How NVFP4 Speeds Up Training
The NVFP4 format employs innovative techniques to preserve accuracy while pushing precision boundaries. Key features include:
- Micro block scaling: Smaller 16-element blocks reduce errors caused by outlier values.
- Random Hadamard Transform: Gaussianizes weight gradients to minimize noise during quantization.
- 2D weight scaling: Ensures consistent values across transposed gradients and forward propagation.
- Stochastic rounding: Prevents small updates from being lost due to rounding errors.
These techniques are particularly impactful in feed-forward layers of transformers—the computational bottleneck in most LLMs—where NVFP4 replaces FP8 precision. GEMM operations (general matrix multiplications) in these layers are quantized to NVFP4, significantly reducing computational overhead, while maintaining higher precision for attention mechanisms to mitigate quantization noise.
Performance Gains
Benchmarks using the Llama 3 series models illustrate NVFP4’s efficiency. For Llama 3.1 (405 billion parameters), training on NVIDIA's GB300 Grace Blackwell Ultra Superchip achieved a 1.73x speedup versus FP8. Per-GPU throughput jumped from 2,103 TFLOPs (FP8) to 3,633 TFLOPs (NVFP4), underscoring the format’s ability to maximize hardware utilization.
NVIDIA also demonstrated that these gains come without accuracy loss. Training loss for Llama 3 8B models followed nearly identical curves across 10,000 steps, with a negligible difference of 0.026 nats in converged results. This stability makes NVFP4 a compelling option for production-scale AI systems, where cost and time savings are critical.
Why It Matters for the AI Ecosystem
JAX, already favored for its scalability and just-in-time (JIT) compilation, benefits significantly from NVFP4 integration. NVIDIA’s release aligns with broader trends in the AI training ecosystem, where efficiency per GPU hour is increasingly prioritized. For example, earlier in 2026, NVIDIA reported long-context training speedups for JAX workloads using NVSHMEM inside XLA, and new JAX-based libraries like jNO are expanding its applications in neural operators and foundation model training.
The NVFP4 update positions NVIDIA and JAX to remain competitive against alternatives like PyTorch or custom solutions, such as xAI’s proprietary C-based stack, which recently claimed higher GPU efficiency. As AI research budgets grow but remain finite, innovations like NVFP4 will likely drive adoption of frameworks and hardware that maximize return on compute investment.
Getting Started
The NVFP4 training recipe is available through the MaxText framework on the NVIDIA JAX Toolbox GitHub repository. Developers can experiment with two NVFP4 modes—one with Random Hadamard Transform (RHT) for improved convergence and another without RHT for minimal overhead. The public NVIDIA MaxText container ghcr.io/nvidia/jax:maxtext includes all necessary libraries to begin training on Blackwell GPUs.
For teams exploring cost-efficient large model training, NVFP4 offers a robust solution. By optimizing throughput without sacrificing model quality, NVIDIA and JAX continue to solidify their place in the ever-demanding world of AI infrastructure.