Skip to content

fishrscale/SNITCH

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

39 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

SNITCH: Semi-supervised Non-linear Identification and Trajectory Clustering for High-dimensional Data

Cite

If you’re using SNITCH in your work, please cite:

Sex-specific non-linear DNA methylation trajectories across aging predict cancer risk and systemic inflammation Robin Grolaux, Macsue Jacques, Bernadette Jones-Freeman, Steve Horvath, Andrew Teschendorff, Nir Eynon bioRxiv 2025.08.19.671184; doi: https://doi.org/10.1101/2025.08.19.671184


Overview

SNITCH (Semi-supervised Non-linear Identification and Trajectory Clustering for High-dimensional data) is an R package to analyze ageing-related DNA methylation trajectories. It provides a robust, end-to-end workflow to:

  • Classify CpG sites into linear, non-linear (NL), or non-correlated trajectories.
  • Identify DMPs, VMPs, and non-linear DMPs driven by age.
  • Perform FPCA on smoothed non-linear trajectories to capture complex ageing patterns.
  • Cluster non-linear CpGs with k-means, MFUZZ (fuzzy), or HDBSCAN, and compare results via ARI/AMI.

SNITCH aims to be efficient, scalable, and flexible: Try it on your dataset!


Installation

You can install the development version of SNITCH from GitHub with:

# install.packages("pak")
pak::pak("fishrscale/SNITCH")

Or using devtools:

if (!requireNamespace("devtools", quietly = TRUE)) install.packages("devtools")
devtools::install_github("fishrscale/SNITCH")

The clustering demo below also uses: dbscan, factoextra, ggrepel, aricode, dplyr, tibble, ggplot2, and Bioconductor packages Biobase, Mfuzz. Install Bioconductor deps with:

if (!requireNamespace("BiocManager", quietly = TRUE)) install.packages("BiocManager")
BiocManager::install(c("Biobase", "Mfuzz"))

Quick Start

1️⃣ Simulate DNA Methylation Data

Generate a dataset with simulated methylation values for multiple CpG sites following different ageing-related trajectories.

library(SNITCH)

# Simulate data with 300 individuals
simulated_data <- simulate_methylation_data(n_people = 300, plot = TRUE)

# Extract data frames
ages_df        <- simulated_data$ages
groups_df      <- simulated_data$groups
methylation_df <- simulated_data$meth

This will generate and save in a new “Results” folder within the current working directory: - functions_sim_data.pdf: Plots of the predefined methylation patterns. - sample_sim_data.pdf: Visualization of the simulated CpG sites.

2️⃣ Classify CpG Sites

Determine which CpG sites follow linear, non-linear, or non-correlated trajectories.

scaled_data <- prepare_data(data = t(methylation_df), age = ages_df$Age)

classified_cpgs <- run_parallel_classification(dat_scaled = scaled_data$dat_scaled, 
                                               age = scaled_data$Age, 
                                               ages_grid = scaled_data$ages_grid)

head(classified_cpgs)

3️⃣ Perform FPCA on Smoothed Non-Linear CpGs

# Select only non-linear CpGs for FPCA
non_linear_cpgs <- classified_cpgs[grep("NL", classified_cpgs$classification),]$CpG
nl_smooth_rows  <- classified_cpgs[grep("NL", classified_cpgs$classification),]
nl_smooth       <- do.call(rbind, nl_smooth_rows$Predictions)
rownames(nl_smooth) <- non_linear_cpgs

# Perform FPCA
fpca_results <- perform_fpca(nl_var_smooth = nl_smooth, ages_grid = scaled_data$ages_grid)

# Plot FPCA results (files are saved by the function)
plot_fpca_results(fpca_results, ages_grid = scaled_data$ages_grid)

4️⃣ Cluster NL CpGs (k-means, MFUZZ, HDBSCAN) + Compare (ARI/AMI)

Use your favorite unsupervised clustering strategy to group non-linear CpGs using their FPCA scores. Below we illustrate three common options: k-means, fuzzy clustering (MFUZZ), and HDBSCAN, and compare them with Adjusted Rand Index (ARI) and Adjusted Mutual Information (AMI) against the ground truth labels. Use the diagnostics to pick sensible parameters for your data.

library(dbscan)     # HDBSCAN
library(Mfuzz)      # Fuzzy c-means
library(Biobase)    # ExpressionSet
library(factoextra) # k-means diagnostics
library(ggplot2)
library(ggrepel)
library(aricode)    # ARI/AMI
library(dplyr)
library(tibble)

# Ground truth from simulation
truth <- groups_df$Group

# We'll update copies of 'classified_cpgs' so we can compare methods cleanly
class_work <- classified_cpgs
nl_mask    <- class_work$classification == "NL"

# Container for method comparison
df_comp <- tibble(Method = character(), ARI = double(), AMI = double())

add_metrics <- function(pred_labels, method_label) {
  tibble(
    Method = method_label,
    ARI    = aricode::ARI(pred_labels, truth),
    AMI    = aricode::AMI(pred_labels, truth)
  )
}

# A) MFUZZ
eset <- new("ExpressionSet", exprs = fpca_results$scores)
m_opt <- mestimate(eset)
tmp   <- Dmin(eset, m_opt, crange = seq(2, 20, 1), repeats = 3, visu = TRUE)
ggplot2::ggsave("SNITCH_fuzzy_distance.pdf", width = 6, height = 4)

c_opt <- 11
mfuzz_fit <- mfuzz(eset, c = c_opt, m = m_opt)
fuzzy_assign <- apply(mfuzz_fit$membership, 1, which.max)
pred_fuzzy   <- paste0("NL_", fuzzy_assign)
cw_fuzzy <- class_work; cw_fuzzy$classification[nl_mask] <- pred_fuzzy
df_comp  <- dplyr::bind_rows(df_comp, add_metrics(cw_fuzzy$classification, "SNITCH + Fuzzy"))

# B) k-means
p_elbow <- fviz_nbclust(fpca_results$scores, kmeans, k.max = 20, method = "wss") +
  labs(title = "Elbow Method for K-Means",
       x = "Number of Clusters (K)",
       y = "Total Within-Cluster Sum of Squares (WCSS)") +
  theme_minimal()
ggplot2::ggsave("SNITCH_kmeans_elbow.pdf", p_elbow, width = 6, height = 4)

k_opt  <- 8
km_fit <- kmeans(as.matrix(fpca_results$scores), centers = k_opt, nstart = 25)
pred_km <- paste0("NL_", km_fit$cluster)
cw_km <- class_work; cw_km$classification[nl_mask] <- pred_km
df_comp <- dplyr::bind_rows(df_comp, add_metrics(cw_km$classification, "SNITCH + K-Means"))

# C) HDBSCAN
hdb_fit <- hdbscan(as.matrix(fpca_results$scores), minPts = 5)
pred_hdb <- paste0("NL_", hdb_fit$cluster)
cw_hdb <- class_work; cw_hdb$classification[nl_mask] <- pred_hdb
df_comp <- dplyr::bind_rows(df_comp, add_metrics(cw_hdb$classification, "SNITCH + HDBSCAN"))

# D) Compare methods
p_comp <- ggplot(df_comp, aes(x = ARI, y = AMI, label = Method)) +
  geom_point(size = 3) +
  geom_text_repel(size = 3) +
  coord_equal() +
  labs(title = "Clustering Agreement on Simulated Data",
       x = "Adjusted Rand Index (ARI)",
       y = "Adjusted Mutual Information (AMI)") +
  theme_minimal()

print(df_comp)
print(p_comp)
ggplot2::ggsave("SNITCH_clustering_ari_ami.pdf", p_comp, width = 6, height = 4)

Why eval = FALSE above?
The Quick Start sections mirror the full workflow but are turned off during README knit to keep it fast and avoid heavy optional dependencies. See the Demo figures below for runnable, lightweight chunks that generate images you can commit.


Demo figures

A) Example simulated trajectories


Contributing

Issues and PRs are welcome! Please see the issue tracker: https://github.com/fishrscale/SNITCH/issues.

License

Apache License 2.0. See LICENSE for details.


🚀 SNITCH — Bringing Non-Linear insights to your analysis.

About

Semi-supervised Non-linear Identification and Trajectory Clustering for High-dimensional Data

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages