AI Model Geneformer Unlocks Gene Networks Using Limited Data

Geneformer, a powerful artificial intelligence (AI) model, has emerged as a significant tool in understanding gene network dynamics and interactions using limited data. Developed by researchers at the Broad Institute of MIT and Harvard, this model leverages transfer learning from extensive single-cell transcriptome data to make accurate predictions about gene behavior and disease mechanisms, facilitating faster drug target discovery and advancing the comprehension of complex genetic networks, according to NVIDIA Technical Blog.

A BERT-like Reference Model for Single-Cell Data

Geneformer employs a BERT-like transformer architecture, pre-trained on data from approximately 30 million single-cell transcriptomes across various human tissues. Its attention mechanism focuses on the most relevant parts of the input data, enabling the model to consider relationships and dependencies between genes. During its pretraining phase, Geneformer uses a masked language modeling technique, where a portion of the gene expression data is masked, and the model learns to predict the masked genes based on the surrounding context. This approach allows the model to understand complex gene interactions without the need for labeled data.

This architecture and training method enhance predictive accuracy across various tasks related to chromatin and gene network dynamics, even with limited data. For instance, Geneformer can reconstruct crucial gene networks in heart endothelial cells using only 5,000 cells of data, a task that previously required over 30,000 cells with state-of-the-art methods.

Enhanced Predictive Capabilities

Geneformer also demonstrates impressive accuracy in specific cell type classification tasks. Using a Crohn’s Disease small intestine dataset for evaluation, the NVIDIA BioNeMo model showed performance improvements over baseline models in accuracy and F1 score. The comparisons used a baseline Logp1 PCA+RF model trained on normalized and log-transformed expression counts. Geneformer models with 10M and 106M parameters showed improved cell annotation accuracy and F1 scores over these baseline models.

Scalability and Advanced Features

To support the next generation of Geneformer-based models, the BioNeMo Framework has introduced two new features. Firstly, a data loader that accelerates data loading four times faster than the published method, maintaining compatibility with the original data types. Secondly, Geneformer now supports tensor and pipeline parallelism, which helps manage memory constraints and reduces training time, making it feasible to train models with billions of parameters using multiple GPUs.

Geneformer is part of a growing suite of accelerated single-cell and spatial omics analysis tools within the NVIDIA Clara suite. These tools can be integrated into complementary research workflows for drug discovery, exemplified by research at The Translational Genomics Research Institute (TGen). The RAPIDS suite of programming libraries, including the RAPIDS-SINGLECELL toolkit and ScanPy library, accelerates preprocessing, visualization, clustering, trajectory inference, and differential expression testing of omics data.

A Foundation AI Model for Disease Modeling

Geneformer’s applications span molecular to organismal-scale problems, making it a versatile tool for biological research. The model is now open-source and available for research. It supports zero-shot learning, enabling it to predict data classes it hasn’t explicitly been trained for. In gene regulation research, for instance, Geneformer can be fine-tuned on datasets measuring gene expression changes in response to varying levels of transcription factors, aiding in understanding gene regulation and potential therapeutic interventions.

Fine-tuning Geneformer on datasets capturing cell state transitions during differentiation can enable precise classification of cell states, assisting in understanding differentiation processes and development. The model can also identify cooperative interactions between transcription factors, enhancing the understanding of complex regulatory mechanisms.

Get Started

The 6-layer (30M parameter) and 12-layer (106M parameter) models, along with fully accelerated example code for training and deployment, are available through the NVIDIA BioNeMo Framework on NVIDIA NGC.