NVIDIA Megatron Boosts LLM Training With Muon Optimizer
NVIDIA is pushing the boundaries of large language model (LLM) training with its integration of advanced optimizers like Muon into the Megatron Core framework. According to NVIDIA’s April 22, 2026 blog post, the Muon optimizer, based on higher-order mathematical methods, has achieved near-parity training throughput with the widely used AdamW optimizer while enhancing model performance on large-scale systems like the NVIDIA GB300 NVL72.
Muon, short for MomentUm Orthogonalized by Newton-Schulz, is a higher-order optimization algorithm. It has been instrumental in training leading open-source models such as Kimi K2 and GLM-5. By leveraging advanced preconditioning techniques, the optimizer ensures higher FLOPs utilization (floating point operations per second), a critical metric for maximizing computational efficiency in LLMs.
Performance Metrics: Muon vs. AdamW
Table 1 from NVIDIA’s report shows that Muon delivers comparable throughput to AdamW on the GB300 NVL72 system. For instance, the Kimi K2 model achieved 1,080 TFLOPs/s/GPU with Muon, slightly surpassing AdamW’s 1,051 TFLOPs/s/GPU. Similarly, the Qwen3 30B model reached 721 TFLOPs/s/GPU with Muon compared to 713 TFLOPs/s/GPU with AdamW.
These results were obtained using the NVIDIA NeMo Megatron Bridge 26.02, a PyTorch-native library designed for pretraining and fine-tuning LLMs. The high-performance benchmarks highlight Muon’s ability to handle the computational demands of modern AI workloads without sacrificing efficiency.
Technological Innovations
Scaling Muon to thousands of GPUs presents challenges, including increased computational and memory costs during preconditioning, as well as communication bottlenecks in distributed systems. NVIDIA addresses these hurdles through several innovations:
- Layer-Wise Distributed Optimizer: Full layers of model parameters are distributed across GPUs, enabling efficient preconditioning without excessive communication overhead.
- Distributed Newton-Schulz: Two modes—duplicated and distributed—allow flexible handling of momentum updates. While the duplicated mode minimizes latency, the distributed mode optimizes computational efficiency.
- Communication Hiding and SYRK Fusion: Techniques like overlapping parameter updates with computation and fusing SYRK operations with communication significantly reduce latency, boosting overall throughput.
Implications and Future Developments
By integrating Muon into the Megatron Core, NVIDIA is equipping researchers and developers with tools to improve LLM training at scale. The near-parity performance with AdamW makes Muon an attractive choice, especially as upcoming updates promise further efficiency gains. These include enhanced load balancing, better communication strategies, and advanced kernel optimizations for SYRK operations.
For those eager to explore these technologies, NVIDIA has made tools and performance recipes available through its Megatron Bridge GitHub repository. With these resources, researchers can implement and benchmark emerging optimizers like Muon in their own LLM projects.