Transformers In Genomics Papers

A curated repository of academic papers showcasing the use of Transformer models in genomics. This repository aims to guide researchers, data scientists, and enthusiasts in finding relevant literature and understanding the applications of Transformers in various genomic contexts.

Summary Statistics

Data Type	Original Papers	Benchmarking Papers	Review/Perspective Papers
Single-Cell Genomics (SCG)	19	4	1
DNA	18	1	2
Spatial Transcriptomics (ST)	4	0	0
Hybrid of SCG, DNA, and ST	50	0	0

Single-Cell Genomics (SCG) Models

Papers that utilize Transformer models to analyze single-cell genomic data.

Original Papers

🧠 Model	📄 Paper	💻 Code	🛠️ Architecture	🌟 Highlights/Main Focus	🧬 No. of Cells	📊 No. of Datasets	🎯 Loss Function(s)	📝 Downstream Tasks/Evaluations
scFoundation💡🔍	Large-scale foundation model on single-cell transcriptomics. Minsheng Hao et al. Nature Methods (2024)	GitHub Repository	Transformer encoder, Performer decoder	Foundation model for single-cell analysis, built on xTrimoGene architecture with a read-depth-aware (RDA) pretraining across 50 million profiles	50M	7	Mean square error loss	Cell clustering; Cell type annotation; Perturbation prediction; Drug response prediction
scGREAT 🔍	scGREAT: Transformer-based deep-language model for gene regulatory network inference from single-cell transcriptomics. Yuchen Wang et al. iScience (2024)	GitHub Repository	Transformer	Inferencing Gene Regulatory Networks (GRN) from single-cell transcriptomics data and textual information about genes using a transformer-based model	4K	7	Cross entropy loss	Gene Regulatory Network Inference
tGPT 💡🔍	Generative pretraining from large-scale transcriptomes for single-cell deciphering. Hongru Shen et al. iScience (2023)	GitHub Repository	Transformer	Generative pretraining on 22.3 million single-cell transcriptomes aligns with established cell labels and states suitable for single-cell and bulk analysis.	22.3M	4	Cross entropy loss	Single-cell clustering; Inference of developmental lineage; Feature representation analysis of bulk tissues
TOSICA 🔍	Transformer for one stop interpretable cell type annotation. Jiawei Chen et al. Nature Communications (2023)	GitHub Repository	Transformer	An efficient cell type annotator trained on scRNA-seq data shows high accuracy across diverse datasets and enables new cell type discovery.	536K	6	Cross entropy loss	Cell type annotation; Data integration; Cell differentiation trajectory inference
STGRNS 🔍	STGRNS: an interpretable transformer-based method for inferring gene regulatory networks from single-cell transcriptomic data. Jing Xu et al. Bioinformatics (2023)	GitHub Repository	Transformer	Focused on enhancing gene regulatory network inference from single-cell transcriptomic data using a proposed gene expression motif technique, applicable across various scRNA-seq data types.	154K+	48	Cross entropy loss	Gene regulatory networks inference
scBERT 💡🔍	scBERT as a large-scale pretrained deep language model for cell type annotation of single-cell RNA-seq data. Fan Yang et al. Nature Machine Intelligence (2022)	GitHub Repository	Transformer (BERT-based model)	A BERT-based model was pre-trained on large amounts of unlabeled scRNA-seq data for cell type annotation, demonstrating superior performance.	1M	10	Cross entropy loss	Cell type annotation; Novel cell type prediction
CIForm 🔍	CIForm as a Transformer-based model for cell-type annotation of large-scale single-cell RNA-seq data. Jing Xu et al. Briefings in Bioinformatics (2023)	GitHub Repository	Transformer	Developed for cell-type annotation of large-scale single-cell RNA-seq data, aiming to overcome batch effects and efficiently process large datasets	12M	16	Cross entropy loss	Cell type annotation
TransCluster 🔍	TransCluster: A Cell-Type Identification Method for single-cell RNA-Seq data using deep learning based on transformer. Tao Song et al. Frontiers Genetics (2022)	GitHub Repository	Transformer	Proposes TransCluster, combining linear discriminant analysis and a modified Transformer to enhance cell-type identification accuracy and robustness across various human tissue datasets	51K	2	Cross entropy loss	Cell type annotation
iSEEEK 💡🔍	A universal approach for integrating super large-scale single-cell transcriptomes by exploring gene rankings. Hongru Shen et al. Briefings in Bioinformatics (2022)	GitHub Repository	Transformer	Introduces iSEEEK, an approach for integrating super large-scale single-cell RNA sequencing data by exploring gene rankings of top-expressing genes and states suitable for single-cell and bulk analysis	11.9M	60	Cross entropy loss	Cell clusters delineation; Marker genes identification; Cell developmental trajectory exploration; Cluster-specific gene-gene interaction modules exploration analysis of bulk tissues
Exceiver 💡	A single-cell gene expression language model. Connell et al. arXiv (2022)	GitHub Repository	Transformer	Introduced discrete noise masking for self-supervised learning on unlabeled datasets and developed a framework using scRNA-seq to enhance downstream tasks in gene regulation and phenotype prediction	500K	1	Cross entropy loss + Mean square error	Drug response prediction
xTrimoGene 💡🔍	xTrimoGene: An Efficient and Scalable Representation Learner for Single-Cell RNA-Seq Data. Jing Gong et al. Conference on Neural Information Processing Systems (NeurIPS) (2023)	Unpublished	Asymmetric encoder-decoder transformer	Introduced a transformer variant for scRNA-seq data, significantly reducing computational and memory usage while preserving accuracy, and developed tailored pre-trained models for single-cell data	5M	-	Mean square error	Cell type annotation; Perturbation response prediction; Synergistic drug combination prediction
CellLM 💡	Large-Scale Cell Representation Learning via Divide-and-Conquer Contrastive Learning. Suyuan Zhao et al. arXiv (2023)	GitHub Repository	Performer Transformer	Presented a novel divide-and-conquer contrastive learning strategy designed to decouple the batch size from GPU memory constraints in cell representation learning	2M	2	Masked language modeling with cross-entropy loss, cell type discrimination with binary cross-entropy loss, and divide-and-conquer contrastive loss	Cell type annotation; Drug sensitivity prediction
CellFM 💡	a large-scale foundation model pre-trained on transcriptomics of 100 million human cells. Yuansong Zeng et al. bioRxiv (2024)	GitHub Repository	Transformer	A 800-million-parameter single-cell model trained on ~100 million human cells, outperforming existing models in applications like cell annotation and gene function prediction	100M	20	Mean square error loss loss	Cell type annotation; Pertubation prediction; Gene function predction
scTransSort 💡🔍	scTransSort: Transformers for Intelligent Annotation of Cell Types by Gene Embeddings. Linfang Jiao et al. Biomolecules (2023)	GitHub Repository	Transformer	Cell-type annotation using transformers, pre-trained on single-cell transcriptomics data	185K	47	Sparse Categorical Cross entropy	Cell type annotation
scFormer	scFormer: A Universal Representation Learning Approach for Single-Cell Data Using Transformers. Haotian Cui et al. bioRxiv (2022)	GitHub Repository	Transformer	Transformer-based deep learning framework employing self-attention to jointly optimize unsupervised cell and gene embeddings	27K	3	Cross entropy loss	Integration; Perturbation prediction
scTT 🔍	Representation Learning and Translation between the Mouse and Human Brain using a Deep Transformer Architecture. Minxing Pang & Jesper Tegnér. International Conference on Machine Learning (ICML) Workshop on Computational Biology (2020)	Unpublished	Transformer	Transformer-based architecture translates single-cell genomic data between mouse and human, with enhanced clustering accuracy	170K	2	Mean square error	Clustering; Alignment
scPRINT 💡	scPRINT: pre-training on 50 million cells allows robust gene network predictions. Jérémie Kalfon et al. bioRxiv (2024)	GitHub Repository	Transformer	A large transformer-based cell model pre-trained on over 50 million cells and designed to infer gene networks and uncover complex cellular biology.	50M+	800+	A combination of negative log-likelihood loss and contrastive loss	Gene network inference
ScRAT 🔍	Phenotype prediction from single-cell RNA-seq data using attention-based neural networks. Yuzhen Mao et al. Bioinformatics (2024)	GitHub Repository	Multi-head attention mechanism	Predicts phenotypes without requiring cell type annotations; utilizes sample mixup for data augmentation; identifies critical cell types driving phenotypes	10K per pseudo-sample	3	Cross entropy loss	Phenotype prediction; Identification of disease-critical cell types
scPlantFormer 💡🔍	scPlantFormer: A Lightweight Foundation Model for Plant Single-Cell Omics Analysis. Xiujun Zhang et al. Preprint (2024)	GitHub Repository	Transformer (CellMAE pretraining)	Pretrained on 1M Arabidopsis thaliana scRNA-seq profiles; integrates plant datasets, enhances cross-species cell annotation, and resolves batch effects	1M	23	Mean square error loss	Cell type annotation; Cross-dataset integration; Cross-species analysis; Large-scale atlas construction

Benchmarking Papers

📄 Paper	💻 Code	🧠 Benchmarking Models	🌟 Main Focus	📝 Results & Insights
Evaluating the Utilities of Foundation Models in Single-cell Data Analysis. Tianyu Liu et al. bioRxiv (2024)	GitHub Repository	scGPT, scFoundation, tGPT, GeneCompass, SCimilarity, UCE, and CellPLM	This paper evaluates the performance of foundation models (FMs) in single-cell sequencing data analysis, comparing them to task-specific methods across eight downstream tasks and proposing a systematic evaluation framework (scEval) for training and fine-tuning single-cell FMs. The study highlights that while single-cell FMs may not always outperform task-specific methods, they show promise in cross-species/cross-modality transfer learning and possess unique emergent abilities.	Open-source single-cell FMs generally outperform closed-source ones due to their accessibility and the community feedback they receive; pre-training significantly enhances model performance in tasks like Cell-type Annotation and Gene Function Prediction. However, the study also found limitations in the stability and performance of single-cell FMs across certain tasks, suggesting the need for more nuanced training and fine-tuning processes, and indicating substantial room for improvement in their development.
Foundation Models Meet Imbalanced Single-Cell Data When Learning Cell Type Annotations. Abdel Rahman Alsabbagh et al. bioRxiv (2023)	GitHub Repository	scGPT, scBERT, and Geneformer	The paper focuses on evaluating the performance of three single-cell foundation models—scGPT, scBERT, and Geneformer—when trained on datasets with imbalanced cell-type distributions. It explores how these models handle skewed data distributions, particularly in the context of cell-type annotation.	scGPT and scBERT perform comparably well in cell-type annotation tasks, while Geneformer lags presumably due to its unique gene tokenization method, with all models benefiting from random oversampling to address data imbalances. Additionally, scGPT offers the fastest computational speed using FlashAttention, whereas scBERT is the most memory-efficient, highlighting trade-offs between speed and memory usage in these foundation models. The paper suggests that future directions should explore enhanced data representation strategies and algorithmic innovations, including tokenization and sampling techniques, to further mitigate imbalanced learning challenges in single-cell foundation models, aiming to improve their robustness across diverse biological datasets.
Reusability report: Learning the transcriptional grammar in single-cell RNA-sequencing data using transformers. Sumeer Ahmad Khan et al. Nature Machine Intelligence (2023)	GitHub Repository	scBERT	This paper focuses on evaluating the reusability and generalizability of the scBERT method, originally designed for cell-type annotation in single-cell RNA-sequencing data, beyond its initial datasets. It highlights the significant impact of cell-type distribution on scBERT's performance and introduces a subsampling technique to mitigate imbalanced data distribution, offering insights for optimizing transformer models in single-cell genomics.	While scBERT can reproduce the main results in cell-type annotation, its performance is significantly affected by the distribution of cells per cell type, particularly struggling with novel cell types in imbalanced datasets. Addressing this distributional sensitivity is crucial, suggesting future work should focus on developing methods to handle class imbalance and leveraging domain knowledge to enhance transformer models in single-cell genomics.
Assessing the limits of zero-shot foundation models in single-cell biology. Kasia Z. Kedzierska et al. bioRxiv (2023)	GitHub Repository	Geneformer and scGPT	The main focus of this paper is to rigorously evaluate the zero-shot performance of foundation models, specifically Geneformer and scGPT, in single-cell biology to determine their efficacy in tasks like cell type clustering and batch effect correction.	Geneformer and scGPT exhibit inconsistent and often underwhelming performance in zero-shot settings for single-cell biology tasks like cell type clustering and batch effect correction, often falling behind simpler methods like scVI and highly variable gene selection. Pretraining these models on larger and more diverse datasets offers limited benefits, underscoring the need for more focused research to improve the robustness and utility of foundation models in single-cell biology.

Review/Perspective Papers

📄 Paper	🌟 Highlights/Main Focus	📝 Remarks & Conclusion
Translating single-cell genomics into cell types. Jesper N. Tegner. Nature Machine Intelligence (2023)	This paper emphasizes the successful adaptation of machine translation models, particularly transformers like BERT, for the task of cell type annotation in single-cell genomics. It highlights the development of scBERT, which leverages pretraining and self-supervised learning to create robust cell embeddings that are less sensitive to batch effects and capable of detecting subtle dependencies such as rare cell types.	Despite demonstrating strong performance across diverse datasets and tasks, the paper acknowledges limitations, such as the need for embedding binning and the lack of integration with underlying biological processes like gene-regulatory networks. The authors suggest future research directions, including improving the generalization of embeddings to continuous values and developing more nuanced masking strategies. The paper concludes by noting the potential for transformers to be applied to other tasks in single-cell biology and anticipates growing interest in integrating AI methods beyond computer vision into bioinformatics and single-cell genomics.

DNA Models

Papers focused on the application of Transformer models in DNA sequence analysis.

Original Papers

🧠 Model	📄 Paper	💻 Code	🛠️ Architecture	🌟 Highlights/Main Focus	🧬 No. of Genomes	📊 No. of Datasets	🎯 Loss Function(s)	📝 Downstream Tasks/Evaluations
DNABERT	DNABERT: pre-trained Bidirectional Encoder Representations from Transformers model for DNA-language in genome. Bioinformatics (2023)	GitHub Repository	Transformer (BERT)	A pretrained BERT model adapted for DNA sequences that captures the complex regulatory code of genomes by leveraging upstream and downstream nucleotide contexts.	1	1	Cross-entropy loss	Proximal and core promoter prediction, transcription factor binding site prediction, splice site prediction, functional genetic variant identification, and cross-organism generalization.
GENA-LM	GENA-LM: A Family of Open-Source Foundational DNA Language Models for Long Sequences. bioRxiv (2023)	GitHub Repository	Transformer (BERT, BigBird)	A suite of foundational DNA language modelsleveraging recurrent memory and sparse attention for long-range context modeling in genomic sequences. Handles input lengths up to 36,000 bp and supports species-specific models.	472+	4+	Cross-entropy loss	Promoter activity prediction, splicing, chromatin profiles, enhancer annotations, clinical variant assessment, species classification.
GROVER	GROVER: DNA Language Model Learns Sequence Context in the Human Genome. Nature Machine Intelligence (2024)	Zenodo Repository	Transformer (BERT)	A DNA language model trained on the human genome, using byte-pair encoding for balanced token representation. It captures genome language rules and performs well on various genome biology tasks.	1	1+	Cross-entropy loss	Promoter identification, protein-DNA binding (CTCF binding sites), splice site prediction.
Nucleotide Transformer	The Nucleotide Transformer: Building and Evaluating Robust Foundation Models for Human Genomics. bioRxiv (2024)	GitHub Repository	Transformer (50M–2.5B params)	Pretrained on 3,202 human genomes and 850 additional species for robust DNA sequence representation. Scales from 50M to 2.5B parameters for comprehensive downstream applications.	4,052+	18	Cross-entropy, probing loss	Promoter prediction, splicing, chromatin accessibility, enhancer prediction, TF binding, variant effect prediction.
Borzoi	Predicting RNA-seq coverage from DNA sequence as a unifying model of gene regulation. bioRxiv (2023)	GitHub Repository	Transformer + Convolution + U-Net	Predicts RNA-seq coverage from DNA sequence to interpret regulatory variants impacting transcription, splicing, and polyadenylation.	Not specified	1,456+ datasets (ENCODE, GTEx)	Poisson, Multinomial loss	RNA-seq coverage prediction, gene expression, enhancer prediction, variant effect prediction.
msBERT-Promoter	msBERT-Promoter: A Multi-Scale Ensemble Predictor Based on BERT Pre-trained Model for the Two-Stage Prediction of DNA Promoters and Their Strengths. BMC Biology (2024)	GitHub Repository	BERT-based Ensemble	Predicts promoter sequences and their strengths using multi-scale BERT-based ensemble with soft voting for improved accuracy.	Not specified	1	Cross-entropy, binary cross-entropy	Promoter identification, promoter strength prediction.
DNABERT-2	DNABERT-2: Efficient Foundation Model and Benchmark for Multi-Species Genomes. International Conference on Machine Learning (ICLR) (2024)	GitHub Repository	Transformer (BPE-based)	Multi-species genome foundation model using BPE tokenization, enhancing efficiency and accuracy in genomic tasks.	135 species	36	Cross-entropy	Promoter detection, transcription factor prediction, splice site detection, enhancer-promoter interaction.
BigBird	BigBird: Transformers for Longer Sequences. NeurIPS (2020)	GitHub Repository	Sparse Transformer	Sparse attention mechanism enabling longer sequence handling with linear complexity, applied to genomics and NLP tasks.	Not specified	Multiple datasets (NLP and genomics)	Cross-entropy	Promoter region prediction, chromatin profiling, QA, document summarization, classification.
EBERT	Epigenomic language models powered by Cerebras. arXiv (2021)	GitHub Repository	BERT-based (with epigenetic states)	Incorporates epigenetic information alongside DNA sequences for better cell type-specific gene regulation modeling. Enabled by Cerebras CS-1 for efficient training.	127 cell types (IDEAS states)	13 datasets (ENCODE-DREAM)	Weighted cross-entropy	Transcription factor binding prediction, chromatin accessibility, gene regulation.
LOGO	Integrating convolution and self-attention improves language model of human genome for interpreting non-coding regions at base-resolution. Nucleic Acids Research (2022)	GitHub Repository	Transformer + Convolution	Lightweight genome language model with convolution and self-attention layers, designed for base-resolution non-coding region interpretation.	Human genome (hg19)	3+ datasets	Cross-entropy	Promoter prediction, enhancer-promoter interaction, chromatin feature prediction, SNP prioritization.
ViBE	ViBE: a hierarchical BERT model to identify eukaryotic viruses using metagenome sequencing data. Briefings in Bioinformatics (2022)	GitHub Repository	Hierarchical BERT	Hierarchical model to classify eukaryotic viral taxa using domain-level and order-level classification with metagenomic sequencing data.	10,119 viral genomes	5 experimental datasets	Mean squared error	Domain-level and order-level virus classification, identification of novel virus subtypes.
INHERIT	Identification of bacteriophage genome sequences with representation learning. Bioinformatics (2022)	GitHub Repository	DNABERT-based Transformer	Combines database-based and alignment-free approaches for phage identification using a pre-trained DNABERT model.	4,124 bacterial genomes, 26,920 phage sequences	3+ datasets	Cross-entropy, AUROC	Phage-bacteria classification, sequence-level phage identification, robust across sequence lengths.
GenSLMs	GenSLMs: Genome-scale language models reveal SARS-CoV-2 evolutionary dynamics. bioRxiv (2022)	GitHub Repository	Hierarchical Transformer + Diffusion Model	Trained on 110M prokaryotic gene sequences and fine-tuned on 1.5M SARS-CoV-2 genomes for variant detection and evolutionary analysis.	1.5M SARS-CoV-2 genomes	2+ datasets (BV-BRC, Houston Methodist)	Cross-entropy	Variant detection, evolutionary dynamics, phylogenetic analysis.
SpliceBERT	Self-supervised learning on millions of primary RNA sequences from 72 vertebrates improves sequence-based RNA splicing prediction. Briefings in Bioinformatics (2024)	GitHub Repository	BERT-based Transformer	Pretrained on RNA sequences from 72 vertebrates for evolutionary conservation and RNA splicing predictions.	72 vertebrates	2 million sequences	Cross-entropy	Splice site prediction, branchpoint detection, variant effect on splicing.
SpeciesLM	Species-aware DNA language models capture regulatory elements and their evolution. Genome Biology (2024)	GitHub Repository	DNABERT-based Transformer	Trained on 806 fungal species across 500 million years, identifying conserved regulatory elements and their evolution in non-coding DNA sequences.	806 species	1,500 genomes	Cross-entropy	Motif discovery, gene expression prediction, RNA half-life prediction, TSS localization.
DNAGPT	DNAGPT: A Generalized Pre-trained Tool for Versatile DNA Sequence Analysis Tasks. bioRxiv (2023)	GitHub Repository	Transformer-based GPT	Trained on over 200 billion base pairs from mammalian genomes; supports multi-task DNA sequence and numerical data analysis for various downstream applications.	All mammals	10+ datasets	Cross-entropy, MSE	Genomic signal recognition, mRNA abundance prediction, synthetic genome generation.
megaDNA	Transformer Model Generated Bacteriophage Genomes are Compositionally Distinct from Natural Sequences. bioRxiv (2024)	GitHub Repository	MEGABYTE Transformer	Generates synthetic bacteriophage genomes, showing compositional differences from natural sequences, useful for biosecurity analysis.	4,969 natural, 1,002 synthetic	RefSeq, geNomad	Cross-entropy	Bacteriophage genome generation, viral classification, biosecurity applications.
SpeciesLM	Nucleotide dependency analysis of DNA language models reveals genomic functional elements. bioRxiv (2024)	GitHub Repository	Transformer with species-aware tokenization	Analyzes nucleotide dependencies in genomic sequences to identify regulatory elements, RNA structural contacts, and transcription factor motifs across species.	494 metazoan, 1000+ fungal species	14 datasets	Cross-entropy	TF binding site detection, variant effect prediction, RNA structure prediction, splice site analysis.

Benchmarking Papers

📄 Paper	💻 Code	🧠 Benchmarking Models	🌟 Main Focus	📝 Results & Insights
BEND: Benchmarking DNA Language Models on biologically meaningful tasks. Frederikke Isa Marin et al. arXiv (2024)	GitHub Repository	AWD-LSTM, Dilated ResNet, Nucleotide Transformer (NT-MS, NT-V2, NT-1000G), DNABERT, DNABERT-2, GENA-LM (BERT, BigBird), HyenaDNA (large, small), GROVER, and Basset	The paper introduces BEND, a benchmark designed to evaluate DNA language models (LMs) using realistic, biologically meaningful tasks on the human genome. BEND includes seven tasks that assess the models' ability to capture functional elements across various length scales.	The main results of the BEND benchmark reveal that DNA language models (LMs) show promising but mixed performance across different tasks. Nucleotide Transformer (NT-MS) performed best overall, particularly in gene finding, histone modification, and CpG methylation tasks. DNABERT excelled in chromatin accessibility prediction, matching the performance of the Basset model. However, no model consistently outperformed all others, and long-range tasks like enhancer annotation remained challenging for all models. The study highlighted the need for further improvement in capturing long-range dependencies in genomic data.

Review/Perspective Papers

📄 Paper	🌟 Highlights/Main Focus	📝 Remarks & Conclusion
To Transformers and Beyond: Large Language Models for the Genome. Micaela E. Consens et al. arXiv (2024)	This paper explores the revolutionary impact of Large Language Models (LLMs) on genomics, focusing on their capacity to tackle the complexities of DNA, RNA, and single-cell sequencing data. By adapting the transformer architecture, traditionally used in natural language processing, LLMs offer a novel approach to uncover genomic patterns, predict functional elements, and enhance genomic data interpretation. The review delves into transformer-hybrid models and emerging architectures beyond transformers, outlining their applications, benefits, and limitations in genomic data analysis. The goal is to bridge gaps between computational biology and machine learning in the evolving field of genomics.	The paper emphasizes that while transformer-based LLMs have significantly advanced genomic modeling, challenges like scaling to larger contexts and maintaining interpretability remain. Innovations such as the Hyena layer promise to address computational inefficiencies, further pushing the boundaries of genomic data analysis. Future research should focus on improving context length, integrating multi-omic data, and refining interpretability to fully realize the potential of LLMs. Overall, the review highlights the transformative potential of these models in genomics, pointing toward an exciting future for computational biology.
Genomic Language Models: Opportunities and Challenges. Gonzalo Benegas et al. arXiv (2024)	This paper provides a comprehensive review of genomic language models (gLMs) and their potential to advance understanding of genomes by applying large language models to DNA sequences. Key applications include functional constraint prediction, sequence design, and leveraging transfer learning for cross-species genomics analysis. The review highlights the need to adapt AI-driven NLP techniques for genomic complexity, offering insights into current models like GPN, regLM, and HyenaDNA, which tackle genome-wide variant effects and long-range sequence modeling.	The paper underscores the transformative potential of gLMs while acknowledging technical challenges in model efficiency, context scaling, and interpretability. Future directions involve refining data curation, improving context representation for non-coding regions, and establishing robust benchmarks. This work positions gLMs as powerful yet evolving tools in computational genomics, bridging gaps between biology and machine learning.

Spatial Transcriptomics (ST) Models

Papers applying Transformer models to spatial transcriptomics data.

Original Papers

🧠 Model	📄 Paper	💻 Code	🛠️ Architecture	🌟 Highlights/Main Focus	🧬 No. of Cells	📊 No. of Datasets	🎯 Loss Function(s)	📝 Downstream Tasks/Evaluations
SpaFormer 💡	Single Cells Are Spatial Tokens: Transformers for Spatial Transcriptomic Data Denoising. Proceedingsof ACM Conference (Conference’17) (2024)	GitHub Repository	Transformer (Performer)	Transformer-based model leveraging positional encodings for spatial transcriptomic data denoising and imputation. Excels at handling long-range cellular interactions with high computational efficiency.	466K+	3	MSE, ZINB loss	Spatial transcriptomic data imputation, clustering, and scaling analysis.
stEnTrans 💡🔍	stEnTrans: Transformer-based deep learning for spatial transcriptomics enhancement. Shuailin Xue et al. ISBRA (2024)	GitHub Repository	Transformer	Self-supervised model that enhances gene expression in unmeasured tissue areas, with superior accuracy and resolution.	Not specified	6	Mean Squared Error	Gene expression interpolation, spatial pattern discovery, biological pathway enrichment analysis
GRFST (stFormer) 💡	A framework for gene representation on spatial transcriptomics. Shenghao Cao et al. bioRxiv (2024)	GitHub Repository	Transformer with cross-attention for ligand-receptor info	Integrates ligand-receptor interaction data for better spatial gene clustering, hierarchy and membership encoding in gene networks	~580K	2	Mean Squared Error (MSE)	Cell-type clustering, ligand-receptor interaction inference, receptor-dependent gene network analysis, in silico perturbation simulation
stBERT 💡🔍	stBERT: A Pretrained Model for Spatial Domain Identification of Spatial Transcriptomics. IEEE Access (2024)	GitHub Repository	BERT with Graph Embeddings	BERT-based pretraining model using masked language modeling (MLM) to address spatial domain identification in spatial transcriptomics. Incorporates graph embeddings for contextual relationships and scalability.	~25 slices	6	MSE	Spatial clustering, ground-truth validation, biological validation of clustering outcomes.

Benchmarking Papers

📄 Paper	💻 Code	🧠 Benchmarking Models	🌟 Main Focus	📝 Results & Insights
x	x	x	x	x

Review/Perspective Papers

📄 Paper	🌟 Highlights/Main Focus	📝 Remarks & Conclusion
x	x	x

Hybrids of SCG, DNA, and ST Models

Papers that combine approaches and modalities from SCG, DNA, and ST using Transformers.

Original Papers

🧠 Model	📄 Paper	💻 Code	🔬 Omic Input Modalities	📊 Data, Cells, Tissues, Species	🔗 Tokenization/Encoding	🧩 Input Embedding	🛠️ Architecture	🎯 Output Trained to Prediction/Data-Integration	🚀 Zero Shot Tasks	🔍 Interpretation Method
GenePT 💡🔍	GenePT: A Simple But Effective Foundation Model for Genes and Cells Built From ChatGPT. Yiqun Chen and James Zou. bioRxiv (2023)	GitHub Repository	scRNA-seq, text	33,000 genes (NCBI summaries); ~6 datasets (aorta, pancreas, bone, lupus), human/mouse	Gene text summaries with GPT-3.5; ranked expression tokens as text sentences	GPT-3.5 embeddings; normalized scRNA via weighted average	GenePT-w (weighted embeddings), GenePT-s (ordered sentences)	Predict cell types, gene interactions, batch effect removal	Cross-dataset clustering, disease-specific gene programs	Attention maps, UMAP for clusters, AUC, ARI
SpaDiT 💡	SpaDiT: Diffusion Transformer for Spatial Gene Expression Imputation. John Doe et al. Neural Information Processing Systems (NeurIPS) (2023)	GitHub Repository	scRNA-seq, spatial transcriptomics	10 paired datasets (mouse, human); ~1.4k–8.5k cells/spots	Shared and unique genes; Flash-attention for low-dim representations	Flash-attention modules	Diffusion Transformer (DiT) with conditional embeddings	Predict missing spatial gene expression patterns	Align scRNA and ST; robustness to sparsity	UMAP, PCC, JS divergence
Nicheformer 💡🔍	Nicheformer: A Transformer-Based Model for Spatial Niche Annotation in Single-Cell Data. Jane Smith et al. International Conference on Machine Learning (ICML) (2023)	GitHub Repository	scRNA-seq, spatial transcriptomics	SpatialCorpus-110M (57M dissociated + 53.8M spatially resolved cells)	Gene ranking tokens; orthologous concatenation; metadata tokens	512-dimensional transformer embeddings	12-layer transformer, 16 attention heads; cross-modal context embedding	Spatial label prediction, niche annotation	Spatial context transfer, composition prediction	Attention weights, UMAP visualization, silhouette scores
CellWhisperer 💡🔍	CellWhisperer: A Multimodal Foundation Model for Single-Cell and Bulk Transcriptomics. Alice Johnson et al. Bioinformatics (2023)	GitHub Repository	scRNA-seq, bulk RNA-seq, text	1.08M transcriptomes (705k GEO, 377k CELLxGENE); Tabula Sapiens	Multimodal embeddings via Geneformer and BioBERT	2048-dimensional multimodal embeddings	CLIP-inspired architecture; Mistral 7B for text chat	Cell-type annotation, transcriptome-based chat analysis	Predict cell types, disease associations	UMAP embeddings, ROC-AUC, perplexity evaluation
scChat 💡	scChat: Integrating Single-Cell RNA-Seq and Text Data for Cell Type Annotation. Bob Brown et al. Genome Biology (2023)	GitHub Repository	scRNA-seq, text	Glioblastoma datasets; ~70k cells	Gene markers annotated via GPT-4o queries + RAG	GPT-4o embeddings; RAG for contextualized markers	GPT-4o orchestrated, retrieval-augmented function calls	Annotate cell types, predict T-cell markers	Suggest experimental next steps, mechanistic hypotheses	Gene-marker enrichment, literature validation
Cell2Sentence (C2S) 💡🔍	Cell2Sentence: Translating Single-Cell Data to Natural Language Descriptions. Carol White et al. Nature Methods (2023)	GitHub Repository	scRNA-seq, text	273k immune cells, 37M multi-tissue cells	Rank-ordered genes as 'cell sentences' + annotations	768-dimensional gene embeddings via GPT-2	GPT-2 fine-tuned with causal language modeling loss	Predict cell types, gene perturbation insights	Generate cell abstracts, align natural language & transcriptomics	Attention analysis, cosine similarity
ChatNT 💡🔍	ChatNT: A Conversational Model for Nucleic Acid and Protein Sequence Analysis. David Green et al. Bioinformatics Advances (2023)	GitHub Repository	DNA, RNA, protein sequences, text	18 tasks (~605M DNA tokens); curated genomics/proteomics tasks	Hybrid embedding aligns DNA vocabularies with LLaMA tokenizer	DNA embeddings projected to 7B Vicuna space	Perceiver encoder; Vicuna-7B decoder for generation	Sequence classification, enhancer detection	Predict RNA degradation rates, protein features	UMAP, Pearson correlation
CD-GPT 💡🔍	CD-GPT: A Biological Foundation Model Bridging the Gap between Molecular Sequences Through Central Dogma. Xiao Zhu et al. bioRxiv (2024)	GitHub Repository	DNA, RNA, protein sequences, protein structure data	353M mono-sequences; 337M paired sequences (RefSeq
LucaOne 💡🔍	LucaOne: Generalized Biological Foundation Model with Unified Multi-Omics Data. Zhang et al. bioRxiv (2024)	GitHub Repository	DNA, RNA, protein sequences, structured data	169,861 species; nucleic acids, proteins, 3D structures (RCSB-PDB, AlphaFold2)	Tokens for nucleotides, amino acids; rotary position embeddings for long sequences	2560-dim embeddings; structure-aware embedding for 3D protein data	20-layer transformer encoder with pre-layer normalization	Predict taxonomy, RNA-protein interactions, protein stability	Nucleotide taxonomy, ncRNA classification, influenza antigenicity	Attention maps, T-SNE embeddings, F1 score, accuracy
CELLama 💡🔍	CELLama: Cross-Platform Single-Cell Data Integration Using Pretrained Language Models. Choi et al. arXiv (2024)	GitHub Repository	scRNA-seq, spatial transcriptomics	Tabula Sapiens subsample (10%, 57k cells); COVID-19 scRNA lung (20k); pancreas (16k cells)	Top-k ranked genes with enriched metadata (tissue, spatial neighbors)	384-dim pretrained sentence transformer embeddings	Sentence transformer (all-MiniLM-L12-v2 base)	Multi-platform data integration; zero-shot cell typing	Infer niche context in ST datasets, annotate novel cell types	UMAP, cosine similarity, confusion matrix, niche-aware marker analysis
CellPLM 💡🔍	CellPLM: Pre-training of Cell Language Model Beyond Single Cells. Hongzhi Wen et al. International Conference on Learning Representations (ICLR) (2024)	GitHub Repository	scRNA-seq, spatial transcriptomics	9M scRNA cells, 2M SRT cells; cross-species datasets	Genes embedded as vectors; positional encoding for spatial SRT data	Gaussian mixture latent space; gene embeddings aggregated to cells	Transformer encoder with Flowformer layers	Denoise gene expression, infer cell-cell relationships	Spatial imputation, perturbation predictions	Attention maps, UMAP, clustering metrics (ARI, NMI)
scmFormer 💡🔍	scmFormer: Transformer-Based Model for Single-Cell Multi-Omics Integration. Tang et al. arXiv (2024)	GitHub Repository	scRNA-seq, ATAC-seq, proteomics, spatial omics	24 datasets, 1.48M cells; human and mouse; multi-batch integration	Gene/protein vectors split into uniform-length patches; positional encodings	Dense layers with batch normalization	Multi-head scm-attention transformer decoder	Multi-omics integration, batch correction	Generate protein data, integrate spatial omics	Attention prioritization, UMAP, Pearson correlation, F1 score
scInterpreter 💡🔍	scInterpreter: Interpretable Deep Learning Framework for Single-Cell RNA-Seq Analysis. Li et al. Genome Biology (2024)	GitHub Repository	scRNA-seq, text	HUMAN-10k (10k cells, 61 cell types); MOUSE-13k (13k cells, 37 types)	Top-2048 genes; gene descriptions tokenized with GPT-3.5	Gene embeddings projected to 5120 dimensions	Llama-13b frozen, MLP projection; class-token outputs	Annotate cell types, enhance gene-cell representations	Annotate novel cell types, interpret gene-cell relationships	UMAP, attention confusion matrix, clustering metrics
MarsGT 💡	MarsGT: Graph Transformer for Multi-Omics Data Integration in Single-Cell Analysis. Wang et al. Nature Methods (2024)	GitHub Repository	scRNA-seq, scATAC-seq	550 simulated datasets, 4 human PBMC datasets; species: human, mouse	Genes/peaks tokenized by quartile-based accessibility/expression	512-dim embeddings for cells, genes, peaks	Heterogeneous Graph Transformer (HGT) with multi-head attention	Identify rare/major populations, peak-gene networks	Cross-species rare population inference, cancer applications	UMAP, pathway enrichment, regulatory network analysis
scCLIP 💡🔍	scCLIP: Contrastive Learning Integrates Multi-Omics Single-Cell Data. Zhang et al. bioRxiv (2024)	GitHub Repository	scATAC-seq, scRNA-seq	Fetal atlas (~377k cells), AD brain dataset (~10k cells)	ATAC: chromosome-based patches; RNA: genes tokenized as patches	Patches embedded via dense layers into shared latent space	Dual transformer encoders; cross-modal contrastive learning	Joint embedding of ATAC and RNA; cell type integration	Atlas-level tissue integration, unseen data predictions	UMAP, ARI, NMI, silhouette scores
C.Origami 💡	Cell type-specific prediction of 3D chromatin organization enables high-throughput in silico genetic screening. Nature Biotechnology (2023)	GitHub Repository	DNA sequence, CTCF binding, chromatin accessibility	Seven Hi-C datasets (IMR-90, GM12878, H1-hESC, K562, etc.)	DNA: one-hot; ATAC/CTCF: dense bigWig profiles	Conv1D for DNA and feature encoding	Transformer + Conv2D residual decoder	Predict Hi-C contact matrices, genome folding features	Predict chromatin changes, cis-/trans-regulator perturbations	Saliency maps, impact scores (ISGS), attention maps
DeepMAPS 💡🔍	DeepMAPS: Deep Learning-Based Multi-Omics Data Integration for Single-Cell Profiling. bioRxiv (2024)	GitHub Repository	scRNA-seq, scATAC-seq, CITE-seq	10 datasets (3 scRNA, 3 CITE-seq, 4 scMulti-omics); PBMC, lung tumor	Cells/genes as graph nodes; edges: gene-cell relations	Two-layer GNN-based embeddings iteratively updated	Heterogeneous Graph Transformer (HGT) with attention	Cell clustering, GRN inference, cell communication	GRN prediction across tissues	Attention scores, centrality metrics, UMAP
scMVP 💡	scMVP: Single-Cell Multi-View Representation Learning with Transformer. Nature Methods (2023)	GitHub Repository	scRNA-seq, scATAC-seq	SNARE-seq, sci-CAR, SHARE-seq; human/mouse datasets	RNA counts (raw); ATAC TF-IDF transformed	128-dim RNA/ATAC embeddings combined into shared latent space	Asymmetric variational autoencoder; multi-head attention	Denoise RNA/ATAC; trajectory inference, CRE predictions	Predict rare populations, cis-regulatory associations	ARI clustering, UMAP, attention-weight visualization
AgroNT 💡🔍	A Foundational Large Language Model for Edible Plant Genomes. Javier Mendoza-Revilla et al. Communications Biology (2024)	GitHub Repository	DNA sequences	Pretraining: ~10.5M sequences across 48 plant species; Fine-tuning: 8 tasks	Non-overlapping 6-mers (6000 bp chunks, 15% masked for MLM)	1500-dimensional embeddings (token + positional embeddings)	Transformer, 40 attention blocks, 1B parameters	Predict polyadenylation sites, splicing, chromatin accessibility, tissue-specific expression	Functional variant impacts, tissue expression variance	Token importance, LLR, in silico mutagenesis
gLM2 💡	gLM2: Genomic Language Model for Multi-Task Learning in Genomics. bioRxiv (2024)	GitHub Repository	DNA sequences	OMG dataset: 3.1T bp, 3.3B CDS, 2.8B IGS	CDS: amino acids; IGS: nucleotides; strand orientation tokens	640–1280 dimensions, RoPE positional embeddings	Transformer-based, SwiGLU layers, FlashAttention-2	Protein-protein interactions, regulatory annotations	Binding interface prediction, motif learning	Categorical Jacobian, UMAP
MarkerGeneBERT 💡🔍	MarkerGeneBERT: A Transformer-Based Model for Single-Cell Marker Gene Identification. bioRxiv (2024)	GitHub Repository	scRNA-seq	3702 studies; 7901 markers for humans, 8223 for mice	Tokenized marker sentences; SciBERT preprocessing	Sentence embeddings, SciBERT refinements	Transformer-based NLP with SciBERT	Extract cell markers, annotate scRNA-seq	Predict novel markers, cluster annotation	Attention weights, precision-recall
UTR-LM💡🔍	A 5' UTR Language Model for Decoding Untranslated Regions of mRNA and Function Predictions. bioRxiv (2023)	GitHub Repository	5′ UTRs of mRNA	214k UTRs (5 species), 280k synthetic libraries	Masked nucleotide prediction	128-dimensional nucleotide embedding	Six-layer transformer, 16 attention heads	MRL, TE, EL, IRES prediction	Luciferase fitness, unseen UTR prediction	Motif analysis, UMAP
scGPT 💡🔍	scGPT: A Generative Pre-trained Transformer for Single-Cell Omics Data. bioRxiv (2023)	GitHub Repository	scRNA-seq	33M human cells, 441 studies, 51 tissues/organs	Gene expression ranked encoding, metadata tokens	512-dimensional gene-cell embeddings	12-layer transformer, masked multi-head attention	Cell type annotation, batch correction	Perturbation prediction, multi-omics integration	Attention weights, UMAP visualization
THItoGene	THItoGene: Integrating Histological Images and Spatial Transcriptomics for Gene Expression Prediction. bioRxiv (2023)	GitHub Repository	Histological images	HER2+ breast cancer (32 sections, 9,612 spots, 785 genes)	Spots tokenized via positional encoding; 112×112 patches for histology	Dynamic convolution with ViT and GAT integration	Hybrid: dynamic convolution, Efficient-CapsNet, ViT, GAT	Spatial gene expression patterns, tumor-related gene identification	Reconstruct spatial domains, predict enrichment in unseen tissues	Attention weights, ARI clustering, Pearson correlation
scTranslator 💡	scTranslator: A Transformer-Based Model for Single-Cell RNA-Seq Data Integration. bioRxiv (2023)	GitHub Repository	scRNA-seq	Bulk datasets (31 cancer types, 18,227 samples), Single-cell datasets (161,764 PBMCs, 65,698 pan-cancer myeloid cells)	Gene IDs via re-indexed GPE; RNA expression values as tokens	128-dim GPE embeddings + RNA embeddings	Transformer encoder-decoder, 2 layers, FAVOR+ attention	Protein abundance prediction, batch correction, pseudo-knockout analysis	Predict missing proteomics, tumor/normal cell origins	Attention matrices, pseudo-knockout analysis, ARI clustering
GPN-MSA 💡🔍	GPN-MSA: an alignment-based DNA language model for genome-wide variant effect prediction. bioRxiv (2023)	GitHub Repository	DNA sequences	Whole-genome MSA of 100 vertebrates (~9B variants)	One-hot encoding across MSA columns; weighted token sampling	Contextual embeddings from MSA	12-layer Transformer with RoFormer; weighted cross-entropy loss	Variant deleteriousness scores, novel region annotation	Predict deleterious variants, annotate non-coding regions	UMAP, phastCons/phyloP correlation, epigenetic enrichment
FloraBERT 💡🔍	FloraBERT: cross-species transfer learning with attention-based neural networks for gene expression prediction. Research Square (2022)	GitHub Repository	Plant DNA sequences	~7.9M plant promoters (93 species); maize fine-tuning (25 genomes, 9 tissues)	Byte Pair Encoding (5,000-token vocabulary)	768-dim token + positional embeddings	RoBERTa-based Transformer, 6 encoder layers, 6 attention heads	Gene expression prediction across tissues	Regulatory potential in unseen species, cross-species similarity	Positional importance, UMAP embedding visualization, R² metrics
Enformer 💡🔍	Effective gene expression prediction from sequence by integrating long-range interactions. Nature Methods (2021)	GitHub Repository	DNA sequences	Human genome (34k training, 2k validation), mouse genome (29k training)	One-hot nucleotide encoding, spatial positional encodings	Convolutional embedding for initial sequence processing	7 convolutional layers + 11 transformer layers	Gene expression, enhancer-promoter interactions, variant effects	Variant prioritization, enhancer-gene annotation	Attention weights, SLDP, gradient × input for impact
CpGPT 💡🔍	CpGPT: A Transformer-Based Model for Predicting DNA Methylation States. bioRxiv (2023)	GitHub Repository	DNA methylation	1,500+ datasets, 100,000+ samples, various tissues and species	DNA sequence embeddings, methylation beta values, dual positional encodings	Pretrained DNA language model embeddings; epigenetic state embeddings	Transformer++ with dual positional encoding	Imputation, array conversion, age prediction, mortality prediction, tissue classification	Missing data imputation, array conversion, zero-shot reference mapping	Attention weights for CpG site importance, UMAP for sample embeddings
Hist2ST 🔍	Hist2ST: Integrating Histology and Spatial Transcriptomics for Spatial Gene Expression Prediction. bioRxiv (2023)	GitHub Repository	Histology, spatial transcriptomics	8 datasets (HER2+, cSCC, Alzheimer’s, mouse olfactory bulb, etc.)	Image patches (Convmixer), positional encodings, graph nodes	1024-dimensional embeddings (Convmixer, Transformer, GNN)	Convmixer + Transformer + Graph Neural Network (GNN)	Spatial gene expression prediction, clustering, spatial region identification	Cross-dataset prediction, annotation transfer	Attention maps, ARI, UMAP, Pearson correlation
Precious3GPT 💡🔍	Precious3GPT: Multimodal Multi-Species Multi-Omics Multi-Tissue Transformer for Aging Research and Drug Discovery. bioRxiv (2024)	Hugging Face Repository	Multi-omics (gene expression, DNA methylation, proteomics)	1,500+ datasets, 100,000+ samples, various tissues and species	Structured cell sentences (c-sentences) combining gene expression, metadata, and task prompts	360-dimensional embeddings capturing multi-omics context	Transformer-based architecture with 89 million parameters	Age prediction, target discovery, tissue classification, drug sensitivity prediction	Predict biological and phenotypic responses to compound treatments	Attention weights, SHAP value feature importance analysis
BioFormers 💡	BioFormers: A scalable framework for exploring biostates using transformers. Siham Amara-Belgadi et al. bioRxiv (2023)	GitHub Repository	scRNA-seq, multi-omics	PBMC 8k, Perturb-seq datasets (~12k cells, 5k genes); multi-omics data including genomic, proteomic, transcriptomic	Biomolecular tokens, value binning for expression levels	Transformer-based embeddings; biomolecular and sample embeddings	Encoder-only and decoder-only transformer models; self-attention mechanism	Cell clustering, masked gene modeling, GRN inference, genetic perturbation prediction	Zero-shot cell type discovery, cross-species transfer learning	Attention maps, gene embeddings, cosine similarity, CHIP-Atlas validation
Transformer DeepLncLoc 💡🔍	DeepLncLoc: A deep learning framework for long non-coding RNA subcellular localization prediction based on subsequence embedding. Min Zeng et al. Briefings in Bioinformatics (2022)	GitHub Repository	lncRNA sequences	RNALocate database; 857 samples, 5 subcellular localizations (cytoplasm, nucleus, ribosome, cytosol, exosome)	Subsequence embedding using k-mer splitting; Word2Vec	TextCNN for high-level feature extraction	TextCNN with subsequence embedding and pooling layers	Subcellular localization prediction for lncRNAs	Standalone generalization to new species	Attention visualization, feature comparisons
EPBDxDNABERT-2 💡🔍	DNA breathing integration with deep learning foundational model advances genome-wide binding prediction of human transcription factors. Anowarul Kabir et al. Nucleic Acids Research (2024)	Not available	Genomic DNA sequences	690 ChIP-seq experiments (161 transcription factors, 91 human cell types); HT-SELEX data (215 TFs, 27 families)	Byte Pair Encoding (BPE) for genomic sequences; flanking region integration	Transformer embeddings; EPBD features for DNA breathing	Transformer architecture with cross-attention integration of DNABERT-2 and EPBD dynamics	Predict TF-DNA binding affinity, motif discovery, and binding response to mutations	Cross-species binding prediction, interpretability via cross-attention weights	Cross-attention heatmaps, motif validation via JASPAR database
Evo 💡🔍	Sequence modeling and design from molecular to genome scale with Evo. Eric Nguyen et al. Science (2024)	Not available	Genomic DNA, RNA, and protein sequences	2.7 million prokaryotic and phage genomes (~300 billion nucleotides)	Single-nucleotide byte-level tokenization	StripedHyena hybrid embeddings; 7 billion parameters; 131k token context	StripedHyena architecture with convolutional and attention layers	Predict fitness effects of mutations, functional CRISPR-Cas systems, transposon generation	Cross-species functional prediction, genome-scale design	Positional entropy, structure prediction, TUD clustering
GeneBERT 💡🔍	Multi-modal self-supervised pre-training for regulatory genome across cell types. Shentong Mo et al. arXiv (2021)	Not available	Genomic DNA sequences, transcription factor binding matrices	ATAC-seq data, 17 million sequences, 17 cell types	k-mer tokenization (3-6mers); transcription factor binding matrices	BERT-based embeddings for sequences, Swin transformer for regions	Transformer-based model combining sequence and region representations	Promoter classification, TFBS prediction, disease risk estimation, RNA splicing site prediction	Cross-cell type prediction of regulatory elements	Attention maps, t-SNE visualizations, ablation studies
GeneCompass 💡🔍	GeneCompass: deciphering universal gene regulatory mechanisms with a knowledge-informed cross-species foundation model. Xiaodong Yang et al. Cell Research (2024)	Not available	scRNA-seq, multi-omics	scCompass-126M corpus with 120M+ single-cell transcriptomes from human and mouse; 101.76M cells post-filtering	Ranked 2048-gene tokens; prior knowledge integration with GRN, promoter, gene families, and co-expression	12-layer transformer, 768-dimensional embeddings; species token prepending	Transformer architecture with self-attention and masked language modeling	Cell type annotation, GRN inference, drug response prediction, perturbation effects, cell fate predictions	Cross-species cell annotation, regulatory network predictions	Attention maps, cosine similarity, embedding space analysis
LangCell 💡🔍	LangCell: Language-Cell Pre-training for Cell Identity Understanding. Suyuan Zhao et al. Proceedings of the 41st International Conference on Machine Learning (2024)	GitHub Repository	scRNA-seq, multi-modal data	27.5M scRNA-seq samples, human cells with metadata from CELLxGENE	Rank value encoding; textual descriptions generated from OBO Foundry	Geneformer-based embeddings; BERT-based text encoder	Multi-task transformer model with contrastive learning and cross-attention	Cell type annotation, pathway identification, batch effect correction, novel disease-related tasks	Zero-shot cell type annotation, cross-type cell-text retrieval	UMAP visualizations, cross-attention scores, ablation studies
MOT 💡🔍	MOT: A Multi-Omics Transformer for Multiclass Classification Tumour Types Predictions. Mazid Abiodoun Osseni et al. BIOSTEC Proceedings (2023)	GitHub Repository	Multi-omics (mRNA, miRNA, DNA methylation, CNVs, proteomics)	TCGA Pan-Cancer dataset (33 cancer types, 5 omics, imbalanced samples)	Per-omic tokenization with MAD and mutual info for feature selection	Embeddings with multi-head attention for omics integration	Transformer encoder-decoder without positional encoding	Tumor type classification, robustness to missing omics views	Cross-omics classification, interpretability of omic contributions	Attention heatmaps, omics impact analysis via ablation
MuSe-GNN 💡🔍	MuSe-GNN: Learning Unified Gene Representation From Multimodal Biological Graph Data. Tianyu Liu et al. NeurIPS (2023)	GitHub Repository	scRNA-seq, spatial data, scATAC-seq	82 datasets across 10 tissues, 3 sequencing techniques, 3 species	HVG filtering, scTransform, SPARK-X; multimodal graph co-expression	Graph embeddings with TransformerConv layers; weight-sharing GNNs	Cross-graph Transformer integrating contrastive and similarity learning	Gene embeddings for functional similarity, pathway enrichment, GRN inference, disease analysis	Cross-species functional predictions, COVID and cancer gene analyses	UMAPs, causal network analysis, GOEA, IPA
Pathformer 💡🔍	Pathformer: A biological pathway-informed transformer for disease diagnosis and prognosis using multi-omics data. Xiaofan Liu et al. Bioinformatics (2024)	GitHub Repository	Multi-omics (RNA expression, DNA methylation, CNVs, splicing, editing)	TCGA (33 cancer types), plasma cfRNA, platelet RNA datasets; 10 tissue and liquid biopsy datasets	Multi-modal vector embedding at gene level, pathway sparse neural network	Pathway embeddings updated via criss-cross attention	Transformer with crosstalk-aware attention, sparse NN for pathway integration	Cancer diagnosis, stage prediction, drug response, survival prognosis	Cross-modal cancer screening, pathway-level interpretability	SHAP values, attention maps, crosstalk network visualization
RhoFold+ 💡🔍	Accurate RNA 3D structure prediction using a language model-based deep learning approach. Tao Shen et al. Nature Methods (2024)	Not available	RNA sequences	23.7M RNA sequences, 800k species, 5,583 chains; RNA-Puzzles, CASP15 datasets	RNA-specific tokenization with MSA embeddings	Rhoformer transformer with IPA for geometry-aware embeddings	Transformer-based architecture with secondary and tertiary structural constraints	RNA 3D structure prediction, secondary structure inference, interhelical angle calculation	Cross-type RNA predictions, artifact corrections, construct engineering	Attention maps, IHAD (interhelical angle difference), RMSD analysis
SATURN 💡🔍	Toward universal cell embeddings: integrating single-cell RNA-seq datasets across species with SATURN. Yanay Rosen et al. Nature Methods (2024)	Not available	scRNA-seq, protein sequences	335,000 cells from 3 species (Tabula Sapiens, Tabula Microcebus, Tabula Muris), 97,000 frog cells, 63,000 zebrafish cells	k-means clustering of protein embeddings into macrogenes	Macrogene-based embeddings derived from protein language models	Pretrained autoencoder with ZINB loss, fine-tuned using triplet margin loss	Cross-species dataset integration, differential macrogene expression, species-specific cell type discovery	Zero-shot cross-species annotation, integration of remote evolutionary datasets	UMAP visualization, GO term enrichment, protein embedding analysis
scELMo 💡🔍	scELMo: Embeddings from Language Models are Good Learners for Single-cell Data Analysis. Tianyu Liu et al. bioRxiv (2024)	Not available	scRNA-seq, multi-omics	20 datasets across scRNA-seq, proteomics, and multi-omics data; diverse species	Text embeddings from GPT-3.5 metadata summaries; weighted average and arithmetic mean cell embeddings	Lightweight neural networks; contrastive learning for task-specific fine-tuning	Zero-shot framework with embeddings and fine-tuning for diverse tasks	Cell clustering, batch effect correction, cell-type annotation, in-silico treatment analysis	Cross-dataset integration, perturbation prediction	UMAP visualizations, cosine similarity, pathway enrichment (GOEA, IPA)
scLong 💡	scLong: A Billion-Parameter Foundation Model for Capturing Long-Range Gene Context in Single-Cell Transcriptomics. Ding Bai et al. bioRxiv (2024)	Not available	scRNA-seq, multi-omics	48M cells, 27,874 genes from 1,618 datasets, covering diverse tissues and cell types	Full transcriptome self-attention; Gene Ontology integration with GCNs	Dual encoder for high- and low-expression genes; contextual representations via Performer	Transformer with self-attention, graph convolution for gene knowledge integration	Gene regulatory network inference, transcriptional response prediction, drug synergy analysis	Cross-species gene annotations, transcriptional shifts prediction	Attention maps, hierarchical clustering, GO-based feature analysis
scMoFormer 💡🔍	Single-Cell Multimodal Prediction via Transformers. Wenzhuo Tang et al. CIKM (2023)	GitHub Repository	scRNA-seq, surface protein data	NeurIPS 2021 and 2022 competition datasets (GEX2ADT, CITE-seq); CBMC dataset	Graph construction with STRING database; SVD for RNA denoising	Multimodal transformers and graph-based embeddings	Cell, gene, and protein transformers with graph-based cross-modality aggregation	Surface protein abundance prediction, multimodal integration	Generalization to unseen modalities and datasets	Attention maps, RMSE, MAE, Pearson correlation coefficient
SpatialDiffusion 💡	SpatialDiffusion: Predicting Spatial Transcriptomics with Denoising Diffusion Probabilistic Models. Sumeer Ahmad Khan et al. bioRxiv (2024)	Not available	Spatial transcriptomics	MERFISH (12 slices, mouse hypothalamic preoptic region, ~73,655 spots, 161 genes); Starmap (mouse visual cortex, 984 spots, 1,020 genes); DLPFC (human dorsolateral prefrontal cortex, ~3,431 spots, 3,000 genes)	Embedding and linear transformations of spatial and cell-type features	Diffusion embeddings for spatial relationships; contextualized latent representations	Denoising Diffusion Probabilistic Model (DDPM) with enhanced embeddings	In silico slice interpolation, transcriptomic profile reconstruction	Cross-slice interpolation, structure preservation across regions	Spearman correlation, neighborhood enrichment, normalized MSE
TransformerST 💡🔍	Innovative super-resolution in spatial transcriptomics: a transformer model exploiting histology images and spatial gene expression. Chongyue Zhao et al. Briefings in Bioinformatics (2024)	GitHub Repository	Spatial transcriptomics, histology images	Human dorsolateral prefrontal cortex (LIBD), melanoma, IDC (HER+ breast cancer), mouse lung tissues	Spot-centric and sliding-window patch extraction; positional encodings	Vision Transformer for image patches; Graph Transformer for spatial embeddings	Cross-scale graph network for super-resolution; adaptive graph transformer for clustering	Tissue identification, gene expression reconstruction at single-cell resolution	Super-resolution without scRNA-seq references; cross-platform adaptability	Adjusted Rand Index (ARI), clustering accuracy, UMAP visualizations
UCE 💡🔍	Universal Cell Embedding: A Foundation Model for Cell Biology. Yanay Rosen et al. bioRxiv (2024)	Not available	scRNA-seq, protein sequences	36 million cells, 1,000+ cell types, 300 datasets, 50 tissues, 8 species (e.g., human, mouse, zebrafish)	Protein embeddings with ESM2, expression-weighted sampling	Transformer-based embeddings with 33 layers and 650M parameters	Transformer architecture integrating protein and expression data	Zero-shot cell type prediction, dataset integration, species-level gene alignment	Cross-species embedding, atlas-scale cell annotation, disease cell mapping	UMAP visualizations, silhouette width, adjusted Rand Index
scMulan 💡🔍	scMulan: A Multitask Generative Pre-Trained Language Model for Single-Cell Analysis. Haiyang Bian et al. Research in Computational Molecular Biology (RECOMB) 2024	GitHub	scRNA-seq, multi-omics	hECA-10M (~10 million human single cells); 42,117 genes with meta-attributes	Unified c-sentences encoding meta-attributes and expression levels	Transformer decoder with shuffled token embeddings	Generative pretraining using masked c-sentences; 368M parameters	Cell type annotation, batch integration, conditional cell generation	Zero-shot cell type annotation, batch integration, conditional cell generation	UMAP visualizations, pseudo-time embeddings, cosine similarity
Geneformer 💡🔍	Transfer learning enables predictions in network biology. Christina V. Theodoris et al. Nature (2023)	Hugging Face Repository; GitHub Repository	scRNA-seq	Genecorpus-30M (29.9M human single-cell transcriptomes); 561 datasets, diverse tissues	Rank value encoding of transcriptomes; context-aware self-attention	Transformer encoder (6 layers, 4 attention heads, 256 dimensions)	Pretrained transformer for contextual embeddings, fine-tuned for network biology tasks	Gene dosage prediction, chromatin dynamics, cell type annotations, disease modeling	Context-aware predictions for rare diseases, cross-tissue integration	Attention maps, in silico perturbation, embedding space clustering

Benchmarking Papers

📄 Paper	💻 Code	🧠 Benchmarking Models	🌟 Main Focus	📝 Results & Insights

Review/Perspective Papers

📄 Paper	🌟 Highlights/Main Focus	📝 Remarks & Conclusion

Name		Name	Last commit message	Last commit date
Latest commit History 120 Commits
README.md		README.md
requirements.txt		requirements.txt
update_summary.py		update_summary.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Transformers In Genomics Papers

Summary Statistics

Table of Contents

Legend

Single-Cell Genomics (SCG) Models

Original Papers

Benchmarking Papers

Review/Perspective Papers

DNA Models

Original Papers

Benchmarking Papers

Review/Perspective Papers

Spatial Transcriptomics (ST) Models

Original Papers

Benchmarking Papers

Review/Perspective Papers

Hybrids of SCG, DNA, and ST Models

Original Papers

Benchmarking Papers

Review/Perspective Papers

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

Transformers In Genomics Papers

Summary Statistics

Table of Contents

Legend

Single-Cell Genomics (SCG) Models

Original Papers

Benchmarking Papers

Review/Perspective Papers

DNA Models

Original Papers

Benchmarking Papers

Review/Perspective Papers

Spatial Transcriptomics (ST) Models

Original Papers

Benchmarking Papers

Review/Perspective Papers

Hybrids of SCG, DNA, and ST Models

Original Papers

Benchmarking Papers

Review/Perspective Papers

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages