DEcodeTree is a machine learning model designed to predict differential expression statistics from gene regulatory features.
Read our full research paper for more details:
The YTHDF Proteins Shape the Brain Gene Signatures of Alzheimer's Disease
This example demonstrates how to download experimental data from GEO and analyze it with DEcodeTree. We'll use GSE122069, which contains RNA-seq data from TDP-43 knockdown experiments in neuronal cell lines. TDP-43 (TAR DNA-binding protein 43) is a critical RNA-binding protein involved in RNA processing and transport, and its dysfunction is implicated in neurodegenerative diseases like ALS and frontotemporal dementia.
⚠️ Computational Requirements: This analysis uses 16 CPU cores and can take anywhere from 1 hour to several hours to complete, depending on your CPU performance and availability. Plan accordingly and ensure your system has sufficient computational resources.
# Install required packages
install.packages(c("devtools", "tidyverse", "data.table", "limma"))
devtools::install_github("stasaki/DEcodeTree")library(tidyverse)
library(data.table)
library(limma)
library(DEcodeTree)
# Download RNA-seq data from GEO
ftp_url <- "https://ftp.ncbi.nlm.nih.gov/geo/series/GSE122nnn/GSE122069/suppl/GSE122069%5FRSEM.gene.TPM.txt.gz"
destfile <- "GSE122069_RSEM.gene.TPM.txt.gz"
download.file(ftp_url, destfile, mode = "wb")# Load and process the expression matrix
expr <- fread("GSE122069_RSEM.gene.TPM.txt.gz") %>%
mutate(V1 = gsub("\\|.+", "", V1)) %>% # Remove gene symbols, keep only ENSEMBL IDs
as.data.frame() %>%
column_to_rownames("V1") %>%
as.matrix()
# Filter for ShySy cell line samples
indx <- grepl("ShySy", colnames(expr))
expr_sub <- expr[, indx]
# Filter lowly expressed genes (keep genes with mean TPM > 1)
cat("Number of genes with mean TPM > 1:", sum(rowMeans(expr_sub) > 1), "\n")
indx <- rowMeans(expr_sub) > 1
expr_sub <- expr_sub[indx, ]
# Log2 transform
expr_sub <- log2(expr_sub + 0.5)# Set up experimental design (3 WT vs 3 KO samples)
design <- model.matrix(~ group,
data = data.frame(group = factor(c(rep("WT", 3), rep("KO", 3)),
levels = c("WT", "KO"))))
# Fit linear model using limma
fit <- lmFit(expr_sub, design)
fit <- eBayes(fit)
# Extract differential expression statistics
deg_stat_data <- topTable(fit, coef = "groupKO", number = Inf, adjust.method = "BH") %>%
as.data.frame() %>%
rownames_to_column("gene_id") %>%
dplyr::select(gene_id, logFC, tstat = t)
cat("Generated DE statistics for", nrow(deg_stat_data), "genes\n")
head(deg_stat_data)⏱️ Runtime Warning: The following analysis uses 16 CPU cores and may take 1-4 hours to complete depending on your system's performance. The model performs extensive hyperparameter tuning (64 iterations) and cross-validation across multiple feature types.
# Run DEcodeTree with comprehensive gene regulatory features
runDEcodeTreeModel(
outcome = "tstat", # Predict t-statistics
deg_stat = deg_stat_data, # Our DE results
feature_type = c("base", "RBP", "miRNA", "TF"), # All regulatory features
out_dir = "./results/", # Output directory
n_splits = 5, # Cross-validation splits
metric = c("COR", "RMSE"), # Performance metrics
num_samples = 64, # Hyperparameter tuning iterations
num_cpus = 16 # Parallel processing cores
)| Parameter | Description | Default |
|---|---|---|
outcome |
A character string specifying the outcome variable (e.g., "logFC", "tstat"). | "tstat" |
deg_stat |
A data frame containing differential expression statistics. | Required |
feature_type |
Features to include ("base", "RBP", "miRNA", "TF", "cell_type"). |
c("base") |
out_dir |
Output directory for results. | "./results/" |
n_splits |
Number of data splits for cross-validation. | 5 |
metric |
Performance metrics ("COR", "RMSE"). |
c("COR", "RMSE") |
measures |
List of measures for performance evaluation compatible with mlr. |
Default measures |
cv_folds |
Number of folds for cross-validation. | 5 |
num_samples |
Number of hyperparameter tuning iterations. | 64 |
num_cpus |
Number of CPUs for parallel processing. | 4 |
param_values |
Default hyperparameter values for XGBoost models. | Predefined list |
param_space |
Parameter set defining the hyperparameter optimization space. | Predefined set |
save_model |
Whether to save trained models. | TRUE |
- Evaluation Results: Summary of model performance metrics (e.g., RMSE, correlation).
- SHAP Values: Feature importance values for interpretability.
- Plots: Visualizations of model performance and SHAP scores.
- Trained Models: Serialized XGBoost models for future use.
The code for this project is licensed under the BSD 3-Clause License. Gene regulatory features were prepared using the following databases. Please ensure compliance with the respective licenses for each data source:
- Transcription factor binding targets: GTRD
- RNA-binding protein binding targets: POSTAR2
- miRNA binding targets: TargetScan