| GenePT π‘π |
GenePT: A Simple But Effective Foundation Model for Genes and Cells Built From ChatGPT. Yiqun Chen and James Zou. bioRxiv (2023) |
GitHub Repository |
scRNA-seq, text |
33,000 genes (NCBI summaries); ~6 datasets (aorta, pancreas, bone, lupus), human/mouse |
Gene text summaries with GPT-3.5; ranked expression tokens as text sentences |
GPT-3.5 embeddings; normalized scRNA via weighted average |
GenePT-w (weighted embeddings), GenePT-s (ordered sentences) |
Predict cell types, gene interactions, batch effect removal |
Cross-dataset clustering, disease-specific gene programs |
Attention maps, UMAP for clusters, AUC, ARI |
| SpaDiT π‘ |
SpaDiT: Diffusion Transformer for Spatial Gene Expression Imputation. John Doe et al. Neural Information Processing Systems (NeurIPS) (2023) |
GitHub Repository |
scRNA-seq, spatial transcriptomics |
10 paired datasets (mouse, human); ~1.4kβ8.5k cells/spots |
Shared and unique genes; Flash-attention for low-dim representations |
Flash-attention modules |
Diffusion Transformer (DiT) with conditional embeddings |
Predict missing spatial gene expression patterns |
Align scRNA and ST; robustness to sparsity |
UMAP, PCC, JS divergence |
| Nicheformer π‘π |
Nicheformer: A Transformer-Based Model for Spatial Niche Annotation in Single-Cell Data. Jane Smith et al. International Conference on Machine Learning (ICML) (2023) |
GitHub Repository |
scRNA-seq, spatial transcriptomics |
SpatialCorpus-110M (57M dissociated + 53.8M spatially resolved cells) |
Gene ranking tokens; orthologous concatenation; metadata tokens |
512-dimensional transformer embeddings |
12-layer transformer, 16 attention heads; cross-modal context embedding |
Spatial label prediction, niche annotation |
Spatial context transfer, composition prediction |
Attention weights, UMAP visualization, silhouette scores |
| CellWhisperer π‘π |
CellWhisperer: A Multimodal Foundation Model for Single-Cell and Bulk Transcriptomics. Alice Johnson et al. Bioinformatics (2023) |
GitHub Repository |
scRNA-seq, bulk RNA-seq, text |
1.08M transcriptomes (705k GEO, 377k CELLxGENE); Tabula Sapiens |
Multimodal embeddings via Geneformer and BioBERT |
2048-dimensional multimodal embeddings |
CLIP-inspired architecture; Mistral 7B for text chat |
Cell-type annotation, transcriptome-based chat analysis |
Predict cell types, disease associations |
UMAP embeddings, ROC-AUC, perplexity evaluation |
| scChat π‘ |
scChat: Integrating Single-Cell RNA-Seq and Text Data for Cell Type Annotation. Bob Brown et al. Genome Biology (2023) |
GitHub Repository |
scRNA-seq, text |
Glioblastoma datasets; ~70k cells |
Gene markers annotated via GPT-4o queries + RAG |
GPT-4o embeddings; RAG for contextualized markers |
GPT-4o orchestrated, retrieval-augmented function calls |
Annotate cell types, predict T-cell markers |
Suggest experimental next steps, mechanistic hypotheses |
Gene-marker enrichment, literature validation |
| Cell2Sentence (C2S) π‘π |
Cell2Sentence: Translating Single-Cell Data to Natural Language Descriptions. Carol White et al. Nature Methods (2023) |
GitHub Repository |
scRNA-seq, text |
273k immune cells, 37M multi-tissue cells |
Rank-ordered genes as 'cell sentences' + annotations |
768-dimensional gene embeddings via GPT-2 |
GPT-2 fine-tuned with causal language modeling loss |
Predict cell types, gene perturbation insights |
Generate cell abstracts, align natural language & transcriptomics |
Attention analysis, cosine similarity |
| ChatNT π‘π |
ChatNT: A Conversational Model for Nucleic Acid and Protein Sequence Analysis. David Green et al. Bioinformatics Advances (2023) |
GitHub Repository |
DNA, RNA, protein sequences, text |
18 tasks (~605M DNA tokens); curated genomics/proteomics tasks |
Hybrid embedding aligns DNA vocabularies with LLaMA tokenizer |
DNA embeddings projected to 7B Vicuna space |
Perceiver encoder; Vicuna-7B decoder for generation |
Sequence classification, enhancer detection |
Predict RNA degradation rates, protein features |
UMAP, Pearson correlation |
| CD-GPT π‘π |
CD-GPT: A Biological Foundation Model Bridging the Gap between Molecular Sequences Through Central Dogma. Xiao Zhu et al. bioRxiv (2024) |
GitHub Repository |
DNA, RNA, protein sequences, protein structure data |
353M mono-sequences; 337M paired sequences (RefSeq |
|
|
|
|
|
|
| LucaOne π‘π |
LucaOne: Generalized Biological Foundation Model with Unified Multi-Omics Data. Zhang et al. bioRxiv (2024) |
GitHub Repository |
DNA, RNA, protein sequences, structured data |
169,861 species; nucleic acids, proteins, 3D structures (RCSB-PDB, AlphaFold2) |
Tokens for nucleotides, amino acids; rotary position embeddings for long sequences |
2560-dim embeddings; structure-aware embedding for 3D protein data |
20-layer transformer encoder with pre-layer normalization |
Predict taxonomy, RNA-protein interactions, protein stability |
Nucleotide taxonomy, ncRNA classification, influenza antigenicity |
Attention maps, T-SNE embeddings, F1 score, accuracy |
| CELLama π‘π |
CELLama: Cross-Platform Single-Cell Data Integration Using Pretrained Language Models. Choi et al. arXiv (2024) |
GitHub Repository |
scRNA-seq, spatial transcriptomics |
Tabula Sapiens subsample (10%, 57k cells); COVID-19 scRNA lung (20k); pancreas (16k cells) |
Top-k ranked genes with enriched metadata (tissue, spatial neighbors) |
384-dim pretrained sentence transformer embeddings |
Sentence transformer (all-MiniLM-L12-v2 base) |
Multi-platform data integration; zero-shot cell typing |
Infer niche context in ST datasets, annotate novel cell types |
UMAP, cosine similarity, confusion matrix, niche-aware marker analysis |
| CellPLM π‘π |
CellPLM: Pre-training of Cell Language Model Beyond Single Cells. Hongzhi Wen et al. International Conference on Learning Representations (ICLR) (2024) |
GitHub Repository |
scRNA-seq, spatial transcriptomics |
9M scRNA cells, 2M SRT cells; cross-species datasets |
Genes embedded as vectors; positional encoding for spatial SRT data |
Gaussian mixture latent space; gene embeddings aggregated to cells |
Transformer encoder with Flowformer layers |
Denoise gene expression, infer cell-cell relationships |
Spatial imputation, perturbation predictions |
Attention maps, UMAP, clustering metrics (ARI, NMI) |
| scmFormer π‘π |
scmFormer: Transformer-Based Model for Single-Cell Multi-Omics Integration. Tang et al. arXiv (2024) |
GitHub Repository |
scRNA-seq, ATAC-seq, proteomics, spatial omics |
24 datasets, 1.48M cells; human and mouse; multi-batch integration |
Gene/protein vectors split into uniform-length patches; positional encodings |
Dense layers with batch normalization |
Multi-head scm-attention transformer decoder |
Multi-omics integration, batch correction |
Generate protein data, integrate spatial omics |
Attention prioritization, UMAP, Pearson correlation, F1 score |
| scInterpreter π‘π |
scInterpreter: Interpretable Deep Learning Framework for Single-Cell RNA-Seq Analysis. Li et al. Genome Biology (2024) |
GitHub Repository |
scRNA-seq, text |
HUMAN-10k (10k cells, 61 cell types); MOUSE-13k (13k cells, 37 types) |
Top-2048 genes; gene descriptions tokenized with GPT-3.5 |
Gene embeddings projected to 5120 dimensions |
Llama-13b frozen, MLP projection; class-token outputs |
Annotate cell types, enhance gene-cell representations |
Annotate novel cell types, interpret gene-cell relationships |
UMAP, attention confusion matrix, clustering metrics |
| MarsGT π‘ |
MarsGT: Graph Transformer for Multi-Omics Data Integration in Single-Cell Analysis. Wang et al. Nature Methods (2024) |
GitHub Repository |
scRNA-seq, scATAC-seq |
550 simulated datasets, 4 human PBMC datasets; species: human, mouse |
Genes/peaks tokenized by quartile-based accessibility/expression |
512-dim embeddings for cells, genes, peaks |
Heterogeneous Graph Transformer (HGT) with multi-head attention |
Identify rare/major populations, peak-gene networks |
Cross-species rare population inference, cancer applications |
UMAP, pathway enrichment, regulatory network analysis |
| scCLIP π‘π |
scCLIP: Contrastive Learning Integrates Multi-Omics Single-Cell Data. Zhang et al. bioRxiv (2024) |
GitHub Repository |
scATAC-seq, scRNA-seq |
Fetal atlas (~377k cells), AD brain dataset (~10k cells) |
ATAC: chromosome-based patches; RNA: genes tokenized as patches |
Patches embedded via dense layers into shared latent space |
Dual transformer encoders; cross-modal contrastive learning |
Joint embedding of ATAC and RNA; cell type integration |
Atlas-level tissue integration, unseen data predictions |
UMAP, ARI, NMI, silhouette scores |
| C.Origami π‘ |
Cell type-specific prediction of 3D chromatin organization enables high-throughput in silico genetic screening. Nature Biotechnology (2023) |
GitHub Repository |
DNA sequence, CTCF binding, chromatin accessibility |
Seven Hi-C datasets (IMR-90, GM12878, H1-hESC, K562, etc.) |
DNA: one-hot; ATAC/CTCF: dense bigWig profiles |
Conv1D for DNA and feature encoding |
Transformer + Conv2D residual decoder |
Predict Hi-C contact matrices, genome folding features |
Predict chromatin changes, cis-/trans-regulator perturbations |
Saliency maps, impact scores (ISGS), attention maps |
| DeepMAPS π‘π |
DeepMAPS: Deep Learning-Based Multi-Omics Data Integration for Single-Cell Profiling. bioRxiv (2024) |
GitHub Repository |
scRNA-seq, scATAC-seq, CITE-seq |
10 datasets (3 scRNA, 3 CITE-seq, 4 scMulti-omics); PBMC, lung tumor |
Cells/genes as graph nodes; edges: gene-cell relations |
Two-layer GNN-based embeddings iteratively updated |
Heterogeneous Graph Transformer (HGT) with attention |
Cell clustering, GRN inference, cell communication |
GRN prediction across tissues |
Attention scores, centrality metrics, UMAP |
| scMVP π‘ |
scMVP: Single-Cell Multi-View Representation Learning with Transformer. Nature Methods (2023) |
GitHub Repository |
scRNA-seq, scATAC-seq |
SNARE-seq, sci-CAR, SHARE-seq; human/mouse datasets |
RNA counts (raw); ATAC TF-IDF transformed |
128-dim RNA/ATAC embeddings combined into shared latent space |
Asymmetric variational autoencoder; multi-head attention |
Denoise RNA/ATAC; trajectory inference, CRE predictions |
Predict rare populations, cis-regulatory associations |
ARI clustering, UMAP, attention-weight visualization |
| AgroNT π‘π |
A Foundational Large Language Model for Edible Plant Genomes. Javier Mendoza-Revilla et al. Communications Biology (2024) |
GitHub Repository |
DNA sequences |
Pretraining: ~10.5M sequences across 48 plant species; Fine-tuning: 8 tasks |
Non-overlapping 6-mers (6000 bp chunks, 15% masked for MLM) |
1500-dimensional embeddings (token + positional embeddings) |
Transformer, 40 attention blocks, 1B parameters |
Predict polyadenylation sites, splicing, chromatin accessibility, tissue-specific expression |
Functional variant impacts, tissue expression variance |
Token importance, LLR, in silico mutagenesis |
| gLM2 π‘ |
gLM2: Genomic Language Model for Multi-Task Learning in Genomics. bioRxiv (2024) |
GitHub Repository |
DNA sequences |
OMG dataset: 3.1T bp, 3.3B CDS, 2.8B IGS |
CDS: amino acids; IGS: nucleotides; strand orientation tokens |
640β1280 dimensions, RoPE positional embeddings |
Transformer-based, SwiGLU layers, FlashAttention-2 |
Protein-protein interactions, regulatory annotations |
Binding interface prediction, motif learning |
Categorical Jacobian, UMAP |
| MarkerGeneBERT π‘π |
MarkerGeneBERT: A Transformer-Based Model for Single-Cell Marker Gene Identification. bioRxiv (2024) |
GitHub Repository |
scRNA-seq |
3702 studies; 7901 markers for humans, 8223 for mice |
Tokenized marker sentences; SciBERT preprocessing |
Sentence embeddings, SciBERT refinements |
Transformer-based NLP with SciBERT |
Extract cell markers, annotate scRNA-seq |
Predict novel markers, cluster annotation |
Attention weights, precision-recall |
| UTR-LMπ‘π |
A 5' UTR Language Model for Decoding Untranslated Regions of mRNA and Function Predictions. bioRxiv (2023) |
GitHub Repository |
5β² UTRs of mRNA |
214k UTRs (5 species), 280k synthetic libraries |
Masked nucleotide prediction |
128-dimensional nucleotide embedding |
Six-layer transformer, 16 attention heads |
MRL, TE, EL, IRES prediction |
Luciferase fitness, unseen UTR prediction |
Motif analysis, UMAP |
| scGPT π‘π |
scGPT: A Generative Pre-trained Transformer for Single-Cell Omics Data. bioRxiv (2023) |
GitHub Repository |
scRNA-seq |
33M human cells, 441 studies, 51 tissues/organs |
Gene expression ranked encoding, metadata tokens |
512-dimensional gene-cell embeddings |
12-layer transformer, masked multi-head attention |
Cell type annotation, batch correction |
Perturbation prediction, multi-omics integration |
Attention weights, UMAP visualization |
| THItoGene |
THItoGene: Integrating Histological Images and Spatial Transcriptomics for Gene Expression Prediction. bioRxiv (2023) |
GitHub Repository |
Histological images |
HER2+ breast cancer (32 sections, 9,612 spots, 785 genes) |
Spots tokenized via positional encoding; 112Γ112 patches for histology |
Dynamic convolution with ViT and GAT integration |
Hybrid: dynamic convolution, Efficient-CapsNet, ViT, GAT |
Spatial gene expression patterns, tumor-related gene identification |
Reconstruct spatial domains, predict enrichment in unseen tissues |
Attention weights, ARI clustering, Pearson correlation |
| scTranslator π‘ |
scTranslator: A Transformer-Based Model for Single-Cell RNA-Seq Data Integration. bioRxiv (2023) |
GitHub Repository |
scRNA-seq |
Bulk datasets (31 cancer types, 18,227 samples), Single-cell datasets (161,764 PBMCs, 65,698 pan-cancer myeloid cells) |
Gene IDs via re-indexed GPE; RNA expression values as tokens |
128-dim GPE embeddings + RNA embeddings |
Transformer encoder-decoder, 2 layers, FAVOR+ attention |
Protein abundance prediction, batch correction, pseudo-knockout analysis |
Predict missing proteomics, tumor/normal cell origins |
Attention matrices, pseudo-knockout analysis, ARI clustering |
| GPN-MSA π‘π |
GPN-MSA: an alignment-based DNA language model for genome-wide variant effect prediction. bioRxiv (2023) |
GitHub Repository |
DNA sequences |
Whole-genome MSA of 100 vertebrates (~9B variants) |
One-hot encoding across MSA columns; weighted token sampling |
Contextual embeddings from MSA |
12-layer Transformer with RoFormer; weighted cross-entropy loss |
Variant deleteriousness scores, novel region annotation |
Predict deleterious variants, annotate non-coding regions |
UMAP, phastCons/phyloP correlation, epigenetic enrichment |
| FloraBERT π‘π |
FloraBERT: cross-species transfer learning with attention-based neural networks for gene expression prediction. Research Square (2022) |
GitHub Repository |
Plant DNA sequences |
~7.9M plant promoters (93 species); maize fine-tuning (25 genomes, 9 tissues) |
Byte Pair Encoding (5,000-token vocabulary) |
768-dim token + positional embeddings |
RoBERTa-based Transformer, 6 encoder layers, 6 attention heads |
Gene expression prediction across tissues |
Regulatory potential in unseen species, cross-species similarity |
Positional importance, UMAP embedding visualization, RΒ² metrics |
| Enformer π‘π |
Effective gene expression prediction from sequence by integrating long-range interactions. Nature Methods (2021) |
GitHub Repository |
DNA sequences |
Human genome (34k training, 2k validation), mouse genome (29k training) |
One-hot nucleotide encoding, spatial positional encodings |
Convolutional embedding for initial sequence processing |
7 convolutional layers + 11 transformer layers |
Gene expression, enhancer-promoter interactions, variant effects |
Variant prioritization, enhancer-gene annotation |
Attention weights, SLDP, gradient Γ input for impact |
| CpGPT π‘π |
CpGPT: A Transformer-Based Model for Predicting DNA Methylation States. bioRxiv (2023) |
GitHub Repository |
DNA methylation |
1,500+ datasets, 100,000+ samples, various tissues and species |
DNA sequence embeddings, methylation beta values, dual positional encodings |
Pretrained DNA language model embeddings; epigenetic state embeddings |
Transformer++ with dual positional encoding |
Imputation, array conversion, age prediction, mortality prediction, tissue classification |
Missing data imputation, array conversion, zero-shot reference mapping |
Attention weights for CpG site importance, UMAP for sample embeddings |
| Hist2ST π |
Hist2ST: Integrating Histology and Spatial Transcriptomics for Spatial Gene Expression Prediction. bioRxiv (2023) |
GitHub Repository |
Histology, spatial transcriptomics |
8 datasets (HER2+, cSCC, Alzheimerβs, mouse olfactory bulb, etc.) |
Image patches (Convmixer), positional encodings, graph nodes |
1024-dimensional embeddings (Convmixer, Transformer, GNN) |
Convmixer + Transformer + Graph Neural Network (GNN) |
Spatial gene expression prediction, clustering, spatial region identification |
Cross-dataset prediction, annotation transfer |
Attention maps, ARI, UMAP, Pearson correlation |
| Precious3GPT π‘π |
Precious3GPT: Multimodal Multi-Species Multi-Omics Multi-Tissue Transformer for Aging Research and Drug Discovery. bioRxiv (2024) |
Hugging Face Repository |
Multi-omics (gene expression, DNA methylation, proteomics) |
1,500+ datasets, 100,000+ samples, various tissues and species |
Structured cell sentences (c-sentences) combining gene expression, metadata, and task prompts |
360-dimensional embeddings capturing multi-omics context |
Transformer-based architecture with 89 million parameters |
Age prediction, target discovery, tissue classification, drug sensitivity prediction |
Predict biological and phenotypic responses to compound treatments |
Attention weights, SHAP value feature importance analysis |
| BioFormers π‘ |
BioFormers: A scalable framework for exploring biostates using transformers. Siham Amara-Belgadi et al. bioRxiv (2023) |
GitHub Repository |
scRNA-seq, multi-omics |
PBMC 8k, Perturb-seq datasets (~12k cells, 5k genes); multi-omics data including genomic, proteomic, transcriptomic |
Biomolecular tokens, value binning for expression levels |
Transformer-based embeddings; biomolecular and sample embeddings |
Encoder-only and decoder-only transformer models; self-attention mechanism |
Cell clustering, masked gene modeling, GRN inference, genetic perturbation prediction |
Zero-shot cell type discovery, cross-species transfer learning |
Attention maps, gene embeddings, cosine similarity, CHIP-Atlas validation |
| Transformer DeepLncLoc π‘π |
DeepLncLoc: A deep learning framework for long non-coding RNA subcellular localization prediction based on subsequence embedding. Min Zeng et al. Briefings in Bioinformatics (2022) |
GitHub Repository |
lncRNA sequences |
RNALocate database; 857 samples, 5 subcellular localizations (cytoplasm, nucleus, ribosome, cytosol, exosome) |
Subsequence embedding using k-mer splitting; Word2Vec |
TextCNN for high-level feature extraction |
TextCNN with subsequence embedding and pooling layers |
Subcellular localization prediction for lncRNAs |
Standalone generalization to new species |
Attention visualization, feature comparisons |
| EPBDxDNABERT-2 π‘π |
DNA breathing integration with deep learning foundational model advances genome-wide binding prediction of human transcription factors. Anowarul Kabir et al. Nucleic Acids Research (2024) |
Not available |
Genomic DNA sequences |
690 ChIP-seq experiments (161 transcription factors, 91 human cell types); HT-SELEX data (215 TFs, 27 families) |
Byte Pair Encoding (BPE) for genomic sequences; flanking region integration |
Transformer embeddings; EPBD features for DNA breathing |
Transformer architecture with cross-attention integration of DNABERT-2 and EPBD dynamics |
Predict TF-DNA binding affinity, motif discovery, and binding response to mutations |
Cross-species binding prediction, interpretability via cross-attention weights |
Cross-attention heatmaps, motif validation via JASPAR database |
| Evo π‘π |
Sequence modeling and design from molecular to genome scale with Evo. Eric Nguyen et al. Science (2024) |
Not available |
Genomic DNA, RNA, and protein sequences |
2.7 million prokaryotic and phage genomes (~300 billion nucleotides) |
Single-nucleotide byte-level tokenization |
StripedHyena hybrid embeddings; 7 billion parameters; 131k token context |
StripedHyena architecture with convolutional and attention layers |
Predict fitness effects of mutations, functional CRISPR-Cas systems, transposon generation |
Cross-species functional prediction, genome-scale design |
Positional entropy, structure prediction, TUD clustering |
| GeneBERT π‘π |
Multi-modal self-supervised pre-training for regulatory genome across cell types. Shentong Mo et al. arXiv (2021) |
Not available |
Genomic DNA sequences, transcription factor binding matrices |
ATAC-seq data, 17 million sequences, 17 cell types |
k-mer tokenization (3-6mers); transcription factor binding matrices |
BERT-based embeddings for sequences, Swin transformer for regions |
Transformer-based model combining sequence and region representations |
Promoter classification, TFBS prediction, disease risk estimation, RNA splicing site prediction |
Cross-cell type prediction of regulatory elements |
Attention maps, t-SNE visualizations, ablation studies |
| GeneCompass π‘π |
GeneCompass: deciphering universal gene regulatory mechanisms with a knowledge-informed cross-species foundation model. Xiaodong Yang et al. Cell Research (2024) |
Not available |
scRNA-seq, multi-omics |
scCompass-126M corpus with 120M+ single-cell transcriptomes from human and mouse; 101.76M cells post-filtering |
Ranked 2048-gene tokens; prior knowledge integration with GRN, promoter, gene families, and co-expression |
12-layer transformer, 768-dimensional embeddings; species token prepending |
Transformer architecture with self-attention and masked language modeling |
Cell type annotation, GRN inference, drug response prediction, perturbation effects, cell fate predictions |
Cross-species cell annotation, regulatory network predictions |
Attention maps, cosine similarity, embedding space analysis |
| LangCell π‘π |
LangCell: Language-Cell Pre-training for Cell Identity Understanding. Suyuan Zhao et al. Proceedings of the 41st International Conference on Machine Learning (2024) |
GitHub Repository |
scRNA-seq, multi-modal data |
27.5M scRNA-seq samples, human cells with metadata from CELLxGENE |
Rank value encoding; textual descriptions generated from OBO Foundry |
Geneformer-based embeddings; BERT-based text encoder |
Multi-task transformer model with contrastive learning and cross-attention |
Cell type annotation, pathway identification, batch effect correction, novel disease-related tasks |
Zero-shot cell type annotation, cross-type cell-text retrieval |
UMAP visualizations, cross-attention scores, ablation studies |
| MOT π‘π |
MOT: A Multi-Omics Transformer for Multiclass Classification Tumour Types Predictions. Mazid Abiodoun Osseni et al. BIOSTEC Proceedings (2023) |
GitHub Repository |
Multi-omics (mRNA, miRNA, DNA methylation, CNVs, proteomics) |
TCGA Pan-Cancer dataset (33 cancer types, 5 omics, imbalanced samples) |
Per-omic tokenization with MAD and mutual info for feature selection |
Embeddings with multi-head attention for omics integration |
Transformer encoder-decoder without positional encoding |
Tumor type classification, robustness to missing omics views |
Cross-omics classification, interpretability of omic contributions |
Attention heatmaps, omics impact analysis via ablation |
| MuSe-GNN π‘π |
MuSe-GNN: Learning Unified Gene Representation From Multimodal Biological Graph Data. Tianyu Liu et al. NeurIPS (2023) |
GitHub Repository |
scRNA-seq, spatial data, scATAC-seq |
82 datasets across 10 tissues, 3 sequencing techniques, 3 species |
HVG filtering, scTransform, SPARK-X; multimodal graph co-expression |
Graph embeddings with TransformerConv layers; weight-sharing GNNs |
Cross-graph Transformer integrating contrastive and similarity learning |
Gene embeddings for functional similarity, pathway enrichment, GRN inference, disease analysis |
Cross-species functional predictions, COVID and cancer gene analyses |
UMAPs, causal network analysis, GOEA, IPA |
| Pathformer π‘π |
Pathformer: A biological pathway-informed transformer for disease diagnosis and prognosis using multi-omics data. Xiaofan Liu et al. Bioinformatics (2024) |
GitHub Repository |
Multi-omics (RNA expression, DNA methylation, CNVs, splicing, editing) |
TCGA (33 cancer types), plasma cfRNA, platelet RNA datasets; 10 tissue and liquid biopsy datasets |
Multi-modal vector embedding at gene level, pathway sparse neural network |
Pathway embeddings updated via criss-cross attention |
Transformer with crosstalk-aware attention, sparse NN for pathway integration |
Cancer diagnosis, stage prediction, drug response, survival prognosis |
Cross-modal cancer screening, pathway-level interpretability |
SHAP values, attention maps, crosstalk network visualization |
| RhoFold+ π‘π |
Accurate RNA 3D structure prediction using a language model-based deep learning approach. Tao Shen et al. Nature Methods (2024) |
Not available |
RNA sequences |
23.7M RNA sequences, 800k species, 5,583 chains; RNA-Puzzles, CASP15 datasets |
RNA-specific tokenization with MSA embeddings |
Rhoformer transformer with IPA for geometry-aware embeddings |
Transformer-based architecture with secondary and tertiary structural constraints |
RNA 3D structure prediction, secondary structure inference, interhelical angle calculation |
Cross-type RNA predictions, artifact corrections, construct engineering |
Attention maps, IHAD (interhelical angle difference), RMSD analysis |
| SATURN π‘π |
Toward universal cell embeddings: integrating single-cell RNA-seq datasets across species with SATURN. Yanay Rosen et al. Nature Methods (2024) |
Not available |
scRNA-seq, protein sequences |
335,000 cells from 3 species (Tabula Sapiens, Tabula Microcebus, Tabula Muris), 97,000 frog cells, 63,000 zebrafish cells |
k-means clustering of protein embeddings into macrogenes |
Macrogene-based embeddings derived from protein language models |
Pretrained autoencoder with ZINB loss, fine-tuned using triplet margin loss |
Cross-species dataset integration, differential macrogene expression, species-specific cell type discovery |
Zero-shot cross-species annotation, integration of remote evolutionary datasets |
UMAP visualization, GO term enrichment, protein embedding analysis |
| scELMo π‘π |
scELMo: Embeddings from Language Models are Good Learners for Single-cell Data Analysis. Tianyu Liu et al. bioRxiv (2024) |
Not available |
scRNA-seq, multi-omics |
20 datasets across scRNA-seq, proteomics, and multi-omics data; diverse species |
Text embeddings from GPT-3.5 metadata summaries; weighted average and arithmetic mean cell embeddings |
Lightweight neural networks; contrastive learning for task-specific fine-tuning |
Zero-shot framework with embeddings and fine-tuning for diverse tasks |
Cell clustering, batch effect correction, cell-type annotation, in-silico treatment analysis |
Cross-dataset integration, perturbation prediction |
UMAP visualizations, cosine similarity, pathway enrichment (GOEA, IPA) |
| scLong π‘ |
scLong: A Billion-Parameter Foundation Model for Capturing Long-Range Gene Context in Single-Cell Transcriptomics. Ding Bai et al. bioRxiv (2024) |
Not available |
scRNA-seq, multi-omics |
48M cells, 27,874 genes from 1,618 datasets, covering diverse tissues and cell types |
Full transcriptome self-attention; Gene Ontology integration with GCNs |
Dual encoder for high- and low-expression genes; contextual representations via Performer |
Transformer with self-attention, graph convolution for gene knowledge integration |
Gene regulatory network inference, transcriptional response prediction, drug synergy analysis |
Cross-species gene annotations, transcriptional shifts prediction |
Attention maps, hierarchical clustering, GO-based feature analysis |
| scMoFormer π‘π |
Single-Cell Multimodal Prediction via Transformers. Wenzhuo Tang et al. CIKM (2023) |
GitHub Repository |
scRNA-seq, surface protein data |
NeurIPS 2021 and 2022 competition datasets (GEX2ADT, CITE-seq); CBMC dataset |
Graph construction with STRING database; SVD for RNA denoising |
Multimodal transformers and graph-based embeddings |
Cell, gene, and protein transformers with graph-based cross-modality aggregation |
Surface protein abundance prediction, multimodal integration |
Generalization to unseen modalities and datasets |
Attention maps, RMSE, MAE, Pearson correlation coefficient |
| SpatialDiffusion π‘ |
SpatialDiffusion: Predicting Spatial Transcriptomics with Denoising Diffusion Probabilistic Models. Sumeer Ahmad Khan et al. bioRxiv (2024) |
Not available |
Spatial transcriptomics |
MERFISH (12 slices, mouse hypothalamic preoptic region, ~73,655 spots, 161 genes); Starmap (mouse visual cortex, 984 spots, 1,020 genes); DLPFC (human dorsolateral prefrontal cortex, ~3,431 spots, 3,000 genes) |
Embedding and linear transformations of spatial and cell-type features |
Diffusion embeddings for spatial relationships; contextualized latent representations |
Denoising Diffusion Probabilistic Model (DDPM) with enhanced embeddings |
In silico slice interpolation, transcriptomic profile reconstruction |
Cross-slice interpolation, structure preservation across regions |
Spearman correlation, neighborhood enrichment, normalized MSE |
| TransformerST π‘π |
Innovative super-resolution in spatial transcriptomics: a transformer model exploiting histology images and spatial gene expression. Chongyue Zhao et al. Briefings in Bioinformatics (2024) |
GitHub Repository |
Spatial transcriptomics, histology images |
Human dorsolateral prefrontal cortex (LIBD), melanoma, IDC (HER+ breast cancer), mouse lung tissues |
Spot-centric and sliding-window patch extraction; positional encodings |
Vision Transformer for image patches; Graph Transformer for spatial embeddings |
Cross-scale graph network for super-resolution; adaptive graph transformer for clustering |
Tissue identification, gene expression reconstruction at single-cell resolution |
Super-resolution without scRNA-seq references; cross-platform adaptability |
Adjusted Rand Index (ARI), clustering accuracy, UMAP visualizations |
| UCE π‘π |
Universal Cell Embedding: A Foundation Model for Cell Biology. Yanay Rosen et al. bioRxiv (2024) |
Not available |
scRNA-seq, protein sequences |
36 million cells, 1,000+ cell types, 300 datasets, 50 tissues, 8 species (e.g., human, mouse, zebrafish) |
Protein embeddings with ESM2, expression-weighted sampling |
Transformer-based embeddings with 33 layers and 650M parameters |
Transformer architecture integrating protein and expression data |
Zero-shot cell type prediction, dataset integration, species-level gene alignment |
Cross-species embedding, atlas-scale cell annotation, disease cell mapping |
UMAP visualizations, silhouette width, adjusted Rand Index |
| scMulan π‘π |
scMulan: A Multitask Generative Pre-Trained Language Model for Single-Cell Analysis. Haiyang Bian et al. Research in Computational Molecular Biology (RECOMB) 2024 |
GitHub |
scRNA-seq, multi-omics |
hECA-10M (~10 million human single cells); 42,117 genes with meta-attributes |
Unified c-sentences encoding meta-attributes and expression levels |
Transformer decoder with shuffled token embeddings |
Generative pretraining using masked c-sentences; 368M parameters |
Cell type annotation, batch integration, conditional cell generation |
Zero-shot cell type annotation, batch integration, conditional cell generation |
UMAP visualizations, pseudo-time embeddings, cosine similarity |
| Geneformer π‘π |
Transfer learning enables predictions in network biology. Christina V. Theodoris et al. Nature (2023) |
Hugging Face Repository; GitHub Repository |
scRNA-seq |
Genecorpus-30M (29.9M human single-cell transcriptomes); 561 datasets, diverse tissues |
Rank value encoding of transcriptomes; context-aware self-attention |
Transformer encoder (6 layers, 4 attention heads, 256 dimensions) |
Pretrained transformer for contextual embeddings, fine-tuned for network biology tasks |
Gene dosage prediction, chromatin dynamics, cell type annotations, disease modeling |
Context-aware predictions for rare diseases, cross-tissue integration |
Attention maps, in silico perturbation, embedding space clustering |