NVIDIA Scales AlphaFold-Multimer for Proteome-Wide Protein Complex Prediction
NVIDIA researchers have published a technical blueprint for scaling protein complex structure prediction across entire proteomes, extending the AlphaFold Database with homomeric and heteromeric protein complexes generated through GPU-accelerated pipelines running on H100 Superpod clusters.
The work addresses a critical gap in structural biology. While AlphaFold2 revolutionized single-protein structure prediction and the AlphaFold Protein Structure Database now covers monomeric proteins extensively, structural information for protein complexes—how proteins actually interact in biological processes—has remained largely unavailable at scale.
The Computational Challenge
Predicting protein complexes presents unique scaling problems that don't exist for monomers. The combinatorial explosion of possible protein pairings, the computational cost of multiple sequence alignment generation, and the need to validate interface accuracy rather than just overall structure all compound the difficulty.
NVIDIA's approach decouples the two most compute-intensive steps: MSA generation and structure inference. For MSA generation, the team used MMseqs2-GPU with ColabFold, running one server process per GPU and stacking up to three staggered search processes to reduce GPU idle time. This oversubscription of CPU resources yielded up to 25% throughput improvement on DGX H100 nodes.
Structure prediction leveraged both ColabFold's JAX-based folding and an OpenFold implementation accelerated through TensorRT and cuEquivariance. Benchmark testing on 125 X-ray resolved homodimers showed the accelerated OpenFold pipeline matched ColabFold's interface accuracy, with 75.41% of predictions reaching "usable" quality versus 72.95% for ColabFold, and mean DockQ scores of 0.647 versus 0.637.
Dataset Selection Strategy
Rather than attempting computationally intractable all-against-all predictions, the team prioritized proteomes by perceived importance—human-relevant organisms and WHO priority pathogens—and filtered heteromeric predictions using STRING database interaction evidence. Literature suggests filtering for STRING scores above 700 can further reduce inputs while improving prediction quality.
For homodimers, a sequence packing strategy grouped proteins of equal length and sorted by MSA depth to minimize JAX recompilations. Heterodimers required different handling since chain lengths vary, with longer sequences reserved for individual jobs to accommodate SLURM runtime limits.
Implications for Drug Discovery and Biotech
The expanded AlphaFold Database with complex structures enables several downstream applications: variant interpretation at protein interfaces, systems-level structural biology, drug target validation, and benchmarking for generative protein design models.
The partnership with EMBL-EBI, Seoul National University's Steineggerlab, and Google DeepMind aims to make high-confidence complex structures publicly accessible, though the team acknowledges that interface prediction remains substantially harder than monomer prediction. Assessing whether a predicted interface is biologically plausible—and whether it's in the correct binding pocket—requires confidence metrics beyond the pLDDT scores used for monomers.
NVIDIA plans to refine the approach further and expand the universe of available protein complexes in the database, potentially accelerating computational drug discovery workflows that depend on accurate protein interaction structures.