ParallelKernelBench Exposes LLM Weakness in Multi-GPU Kernels

A recent benchmark, ParallelKernelBench (PKB), has revealed significant limitations in large language models (LLMs) when tasked with generating multi-GPU CUDA kernels. Despite advances in AI-driven code generation, including tools like GPT-5.5 and Gemini 3 Pro, fewer than 31% of the 87 problems in the benchmark were solved correctly—and only a subset of those offered performance improvements over baseline implementations.

PKB is noteworthy because it shifts focus from single-GPU tasks to the more complex domain of multi-GPU workloads, which dominate production AI systems today. These workloads require efficient inter-GPU communication, often bottlenecked by bandwidth limitations on technologies like NVLink. In contrast to single-GPU kernel generation, where performance hinges on compute and memory optimization, multi-GPU tasks introduce intricate challenges around data movement and synchronization across GPUs.

Benchmark Findings

PKB evaluates LLMs on their ability to replace standard PyTorch + NCCL (NVIDIA Collective Communications Library) implementations with optimized CUDA kernels. Models were tested across 87 real-world tasks, including workloads from systems like NVIDIA's Megatron-LM and NeMo-RL. The results were underwhelming:

In a zero-shot setting, the best-performing model (GPT-5.5) solved only 28 tasks, with 22 outperforming the baseline.
Allowing three attempts increased success rates, but the best model still achieved a "fast₁@3" score of just 31%.

Failures were attributed to both syntax-level issues (e.g., incorrect CUDA code) and deeper reasoning gaps, such as rank coordination and choosing the optimal GPU-to-GPU communication mechanisms. Stronger models consistently struggled with tasks requiring advanced abstractions like TMA (Tensor Memory Accelerator) or NVLS (NVLink Load/Store).

Why Multi-GPU Is a Harder Problem

The transition from single- to multi-GPU kernel generation radically expands the problem's complexity:

Combinatorial design space: Multi-GPU workloads mix tensor, data, expert, and sequence parallelism, each creating unique communication patterns.
Performance bottlenecks: Unlike single-GPU setups, where compute and memory dominate, multi-GPU performance hinges on interconnect bandwidth.
New design choices: Efficient data movement between GPUs—whether via copy engines, SM load/store, or NVLink paths—requires careful optimization.

PKB's methodology reflects these challenges. Each task starts with a PyTorch + NCCL baseline, and models are asked to generate CUDA kernels that leverage direct GPU-to-GPU communication. The benchmark spans diverse workloads, from large language model (LLM) training to graph neural network (GNN) routing and distributed FFTs.

Glimmers of Success

While the results were mixed, there were notable successes. In rare cases, models generated kernels that outperformed any publicly available implementation. For example, Gemini 3 Pro produced a custom kernel for NVIDIA NeMo-RL's GRPO training loop that fused compute and communication operations, significantly reducing latency compared to the PyTorch + NCCL reference.

Such wins highlight the potential of AI-driven kernel optimization, especially in niche areas where no optimized public references exist. However, these successes remain exceptions rather than the norm.

What Comes Next

PKB's findings underscore the need for further research into multi-GPU kernel generation. Enhancing LLM performance will likely require two major shifts:

Feedback loops: Integrating iterative feedback (e.g., debugging, performance profiling) into the generation process could help LLMs refine their outputs.
Training data: Expanding datasets to include more examples of multi-GPU workloads—especially those involving advanced communication primitives—may help models develop stronger priors.

PKB also suggests future benchmarks should extend beyond intra-node NVLink to inter-node fabrics like InfiniBand or RoCE, where communication complexity increases further.

Why It Matters

As AI systems scale, the efficiency of multi-GPU workloads will directly impact the cost and speed of model training and inference. PKB highlights how far LLMs still have to go before they can autonomously optimize large-scale distributed infrastructure. For developers and researchers, the benchmark sets a clear target: closing the gap between "working" distributed kernels and truly optimized ones.

PKB is open-source, inviting contributions and collaboration to tackle these challenges. Those interested can access the benchmark and submit new tasks via npaek@together.ai.