SNITCH: Semi-supervised Non-linear Identification and Trajectory Clustering for High-dimensional Data
If you’re using SNITCH in your work, please cite:
Sex-specific non-linear DNA methylation trajectories across aging predict cancer risk and systemic inflammation Robin Grolaux, Macsue Jacques, Bernadette Jones-Freeman, Steve Horvath, Andrew Teschendorff, Nir Eynon bioRxiv 2025.08.19.671184; doi: https://doi.org/10.1101/2025.08.19.671184
SNITCH (Semi-supervised Non-linear Identification and Trajectory Clustering for High-dimensional data) is an R package to analyze ageing-related DNA methylation trajectories. It provides a robust, end-to-end workflow to:
- Classify CpG sites into linear, non-linear (NL), or non-correlated trajectories.
- Identify DMPs, VMPs, and non-linear DMPs driven by age.
- Perform FPCA on smoothed non-linear trajectories to capture complex ageing patterns.
- Cluster non-linear CpGs with k-means, MFUZZ (fuzzy), or HDBSCAN, and compare results via ARI/AMI.
SNITCH aims to be efficient, scalable, and flexible: Try it on your dataset!
You can install the development version of SNITCH from GitHub with:
# install.packages("pak")
pak::pak("fishrscale/SNITCH")Or using devtools:
if (!requireNamespace("devtools", quietly = TRUE)) install.packages("devtools")
devtools::install_github("fishrscale/SNITCH")The clustering demo below also uses:
dbscan,factoextra,ggrepel,aricode,dplyr,tibble,ggplot2, and Bioconductor packagesBiobase,Mfuzz. Install Bioconductor deps with:if (!requireNamespace("BiocManager", quietly = TRUE)) install.packages("BiocManager") BiocManager::install(c("Biobase", "Mfuzz"))
Generate a dataset with simulated methylation values for multiple CpG sites following different ageing-related trajectories.
library(SNITCH)
# Simulate data with 300 individuals
simulated_data <- simulate_methylation_data(n_people = 300, plot = TRUE)
# Extract data frames
ages_df <- simulated_data$ages
groups_df <- simulated_data$groups
methylation_df <- simulated_data$methThis will generate and save in a new “Results” folder within the current
working directory: - functions_sim_data.pdf: Plots of the predefined
methylation patterns. - sample_sim_data.pdf: Visualization of the
simulated CpG sites.
Determine which CpG sites follow linear, non-linear, or non-correlated trajectories.
scaled_data <- prepare_data(data = t(methylation_df), age = ages_df$Age)
classified_cpgs <- run_parallel_classification(dat_scaled = scaled_data$dat_scaled,
age = scaled_data$Age,
ages_grid = scaled_data$ages_grid)
head(classified_cpgs)# Select only non-linear CpGs for FPCA
non_linear_cpgs <- classified_cpgs[grep("NL", classified_cpgs$classification),]$CpG
nl_smooth_rows <- classified_cpgs[grep("NL", classified_cpgs$classification),]
nl_smooth <- do.call(rbind, nl_smooth_rows$Predictions)
rownames(nl_smooth) <- non_linear_cpgs
# Perform FPCA
fpca_results <- perform_fpca(nl_var_smooth = nl_smooth, ages_grid = scaled_data$ages_grid)
# Plot FPCA results (files are saved by the function)
plot_fpca_results(fpca_results, ages_grid = scaled_data$ages_grid)Use your favorite unsupervised clustering strategy to group non-linear CpGs using their FPCA scores. Below we illustrate three common options: k-means, fuzzy clustering (MFUZZ), and HDBSCAN, and compare them with Adjusted Rand Index (ARI) and Adjusted Mutual Information (AMI) against the ground truth labels. Use the diagnostics to pick sensible parameters for your data.
library(dbscan) # HDBSCAN
library(Mfuzz) # Fuzzy c-means
library(Biobase) # ExpressionSet
library(factoextra) # k-means diagnostics
library(ggplot2)
library(ggrepel)
library(aricode) # ARI/AMI
library(dplyr)
library(tibble)
# Ground truth from simulation
truth <- groups_df$Group
# We'll update copies of 'classified_cpgs' so we can compare methods cleanly
class_work <- classified_cpgs
nl_mask <- class_work$classification == "NL"
# Container for method comparison
df_comp <- tibble(Method = character(), ARI = double(), AMI = double())
add_metrics <- function(pred_labels, method_label) {
tibble(
Method = method_label,
ARI = aricode::ARI(pred_labels, truth),
AMI = aricode::AMI(pred_labels, truth)
)
}
# A) MFUZZ
eset <- new("ExpressionSet", exprs = fpca_results$scores)
m_opt <- mestimate(eset)
tmp <- Dmin(eset, m_opt, crange = seq(2, 20, 1), repeats = 3, visu = TRUE)
ggplot2::ggsave("SNITCH_fuzzy_distance.pdf", width = 6, height = 4)
c_opt <- 11
mfuzz_fit <- mfuzz(eset, c = c_opt, m = m_opt)
fuzzy_assign <- apply(mfuzz_fit$membership, 1, which.max)
pred_fuzzy <- paste0("NL_", fuzzy_assign)
cw_fuzzy <- class_work; cw_fuzzy$classification[nl_mask] <- pred_fuzzy
df_comp <- dplyr::bind_rows(df_comp, add_metrics(cw_fuzzy$classification, "SNITCH + Fuzzy"))
# B) k-means
p_elbow <- fviz_nbclust(fpca_results$scores, kmeans, k.max = 20, method = "wss") +
labs(title = "Elbow Method for K-Means",
x = "Number of Clusters (K)",
y = "Total Within-Cluster Sum of Squares (WCSS)") +
theme_minimal()
ggplot2::ggsave("SNITCH_kmeans_elbow.pdf", p_elbow, width = 6, height = 4)
k_opt <- 8
km_fit <- kmeans(as.matrix(fpca_results$scores), centers = k_opt, nstart = 25)
pred_km <- paste0("NL_", km_fit$cluster)
cw_km <- class_work; cw_km$classification[nl_mask] <- pred_km
df_comp <- dplyr::bind_rows(df_comp, add_metrics(cw_km$classification, "SNITCH + K-Means"))
# C) HDBSCAN
hdb_fit <- hdbscan(as.matrix(fpca_results$scores), minPts = 5)
pred_hdb <- paste0("NL_", hdb_fit$cluster)
cw_hdb <- class_work; cw_hdb$classification[nl_mask] <- pred_hdb
df_comp <- dplyr::bind_rows(df_comp, add_metrics(cw_hdb$classification, "SNITCH + HDBSCAN"))
# D) Compare methods
p_comp <- ggplot(df_comp, aes(x = ARI, y = AMI, label = Method)) +
geom_point(size = 3) +
geom_text_repel(size = 3) +
coord_equal() +
labs(title = "Clustering Agreement on Simulated Data",
x = "Adjusted Rand Index (ARI)",
y = "Adjusted Mutual Information (AMI)") +
theme_minimal()
print(df_comp)
print(p_comp)
ggplot2::ggsave("SNITCH_clustering_ari_ami.pdf", p_comp, width = 6, height = 4)Why
eval = FALSEabove?
The Quick Start sections mirror the full workflow but are turned off during README knit to keep it fast and avoid heavy optional dependencies. See the Demo figures below for runnable, lightweight chunks that generate images you can commit.
Issues and PRs are welcome! Please see the issue tracker: https://github.com/fishrscale/SNITCH/issues.
Apache License 2.0. See LICENSE for details.
🚀 SNITCH — Bringing Non-Linear insights to your analysis.
