This repository contains R scripts for analyzing cell-free DNA (cfDNA) samples, specifically focusing on identifying features that can distinguish between ALS (Amyotrophic Lateral Sclerosis) and Control samples. The analysis involves insert size distribution, fragment end motif analysis, fragment coverage over genomic features, and a classification model to predict disease status.
The provided data is a downsampled subset of chr21 from a larger dataset, processed for reduced file size to facilitate quick analysis.
.
├── Analysis/
│ ├── functions.R
│ ├── InitialAnalysis.R (Provided - not directly run, its logic is integrated into run_analysis.R)
│ ├── run_analysis.R
│ └── setup.R
├── celfie/ (Data directory - to be placed at the same level as Analysis)
│ ├── bam/
│ │ ├──
The celfie data folder can be downloaded from the data provided here. The data is a subset of data taken from this paper and contains 12 ALS samples and 10 Control samples. Data is processed and downsampled, and only selected for chr21 to reduce file size.
To run this analysis, you need R installed.
The necessary R packages can be installed using the setup.R script.
-
Clone the Repository (or download files): Ensure you have the
Analysis/folder containingfunctions.R,run.R, andsetup.R. -
Place Data: Create a
celfie/directory at the same level as yourAnalysis/directory. Insidecelfie/, place:bam/directory containing all.bamfiles.celfie_cfDNA_ss.csv(metadata file).cpgIslandExt.bed(CpG island annotations).
Your directory structure should look like this:
. ├── Analysis/ │ ├── functions.R │ ├── run_analysis.R │ └── setup.R └── celfie/ ├── bam/ ├── celfie_cfDNA_ss.csv └── cpgIslandExt.bedcpgIslandExt.bed can be downloaded from here
-
Install R Packages: Open an R console or RStudio, navigate to the
Analysis/directory (or set your working directory toAnalysis/), and run the setup script:setwd("/path/to/your/Analysis/directory") source("setup.R")
This script will install all required Bioconductor and CRAN packages. This step may take some time depending on your internet connection and if packages are already installed.
After setting up the environment and data, you can run the main analysis script:
-
Open R/RStudio: Ensure your working directory is set to the
Analysis/folder. -
Execute the Analysis Script:
source("run_analysis.R")
The run_analysis.R script will: * Read metadata and BAM file paths.
* Perform insert size analysis, generating distribution plots and statistics.
* Analyze mono-nucleosomal and di-nucleosomal fragment proportions.
* Calculate statistical moments (skewness, kurtosis) of insert sizes.
* Analyze short fragment enrichment.
* Extract fragments overlapping genomic features (CpG islands, promoters) and analyze their insert sizes.
* Calculate short/long fragment ratios across genomic bins on chr21.
* Extract and compare fragment end motifs (5' and 3' ends) between groups, generating sequence logos.
* Analyze fragment end distributions around genomic features.
* Perform binary classification (ALS vs. Control) using Logistic Regression and Random Forest models based on derived features, and report classification metrics (precision, sensitivity, F1-score, AUC-ROC).
All generated plots (PDF and PNG) and summary CSV files will be saved in the Figures/ directory.
Key output files include:
InsertSizeDensity_bySample.pdf: Histogram of insert sizes per sample.
* InsertSizeDensity_byGroup.pdf: Density plot of insert sizes by group.
* MononucleosomalProportionCLEAN_boxplot.pdf: Boxplot of mono-nucleosomal fragment proportions.
* Mono_Di_Ratio_boxplot.pdf: Boxplot of mono/di-nucleosomal fragment ratios.
* fragment_length_summary_statistics_chr21.csv: CSV with mean, median, SD, skewness, and kurtosis FragmentLengthMoments_boxplots.pdf: Boxplots of statistical moments.
* ShortInsertSizeDensity_byGroup.pdf: Density plot of short insert sizes by group.
* ShortFragmentEnrichment_boxplot.pdf: Boxplot of short fragment proportions.
* InsertSizeDistribution_Features_chr21.pdf: Insert size distributions within CpG islands and promoters.
ShortLongRatio_Chr21_Bins.pdf: Short/long fragment ratio across chr21 in 1Mb bins.
* Motif_5primeDE.csv & Motif_3primeDE.csv: Differentially enriched fragment end motifs.
* top_*_motifs_6bp.png: Sequence logo plots for top enriched motifs.
* fragment_end_distributions_chr21.png: Fragment end position distributions around features.
* fragment_end_distributions_significant_bins_chr21.png: Same, with statistically significant bins highlighted.
roc_curve_glm.pdf, roc_curve_rf.pdf, roc_curves_combined.pdf: ROC curves for classification models.
rf_feature_importance.png: Feature importance plot from Random Forest.
* classification_performance_metrics.csv: CSV summarizing classification performance metrics.
The provided data is downsampled to chr21 for faster execution. For full genomic datasets (e.g., 50GB BAM files), There will be small changes needed to run the code. Particularly removing the hard coded chr21 portions. Currently this can scale with many additional BAM files as long as the csv metadata is updated Additionally i would alter functions in the code in the following ways:
- Parallel Processing: Utilizing
BiocParallelforscanBamorreadGAlignmentPairsoperations across multiple BAM files - Streaming/Chunking: Processing BAM files in chunks rather than loading the entire file into memory
- High-Performance Computing (HPC): Running on clusters with sufficient RAM and CPU cores.
- Database Integration: For extremely large datasets, loading processed data into a genomic database for querying.
For any questions or issues, please refer to the R script comments or email tawaunl@gmail.com
This document provides a high-level overview of the analytical steps performed to identify potential markers differentiating ALS from Control samples. The analysis leverages various aspects of fragmentomics to uncover distinguishing patterns.
We start by organizing the input data, identifying the paths to the BAM files and loading a metadata file that categorizes each sample as either ALS or Control. This ensures that samples are correctly grouped for comparative analysis.
Insert size is a crucial characteristic of cfDNA. This section explores how the lengths of DNA fragments differ between the ALS and Control groups.
- Distribution Analysis: The distribution of fragment lengths is examined for all samples, both individually and aggregated by disease group. This helps visualize overall patterns.
- Nucleosomal Fragment Proportions: Specific attention is paid to fragments corresponding to mono-nucleosomal (around 150-180 base pairs) and di-nucleosomal (around 300-380 base pairs) lengths. The proportions of these fragment types are calculated and statistically compared between groups to identify potential differences related to nucleosome packaging.
- Statistical Moments: Statistical measures like skewness and kurtosis are calculated for fragment distributions. These metrics describe the shape of the distribution, which can reveal subtle but significant differences between the groups.
- Short Fragment Enrichment: The proportion of very short fragments (e.g., 85-105 base pairs) is specifically investigated, as these can be indicative of particular biological processes or disease states.
This part of the analysis investigates how cfDNA fragments are distributed across specific functional regions of the genome, such as CpG islands and gene promoters on chromosome 21. This can provide insights into chromatin accessibility and gene regulation.
- Fragment Overlap with Features: Fragments that overlap with known genomic features are identified, and insert size distributions within these regions are analyzed and compared between ALS and Control samples.
- Regional Fragment Ratios: The ratio of short fragments to long fragments is calculated across large genomic bins along chromosome 21. This helps to detect larger-scale regional variations in fragmentation patterns that might be associated with disease.
The specific DNA sequences at the very ends of cfDNA fragments can offer clues about the enzymes that cut the DNA in the body. This section aims to identify and compare these "fragment end motifs."
- Motif Extraction: Short DNA sequences (e.g., 6 base pairs) at both the 5' and 3' ends of the fragments are extracted for all samples.
- Differential Abundance: The frequencies of these motifs are compared between the ALS and Control groups. Statistical tests are performed to identify motifs that are significantly more or less abundant in one group compared to the other.
- Sequence Logos: Visual representations called "sequence logos" are generated for the most differentially abundant motifs, highlighting conserved nucleotides and providing insights into potential cleavage preferences.
This analysis delves into the precise positioning of fragment ends relative to key genomic landmarks, such as transcription start sites (TSS) of genes and the centers of CpG islands. This can reveal patterns related to nucleosome positioning or transcription factor binding.
- Relative Positioning: The distacnes of fragment ends (both 5' and 3') from the centers of specified genomic features are calculated.
- Distribution Comparison: The distributions of these relative positions are plotted and compared between the ALS and Control groups. Regions where fragment ends are significantly enriched or depleted are identified.
- Statistical Significance: Statistical tests are applied to determine if differences in fragment end distributions within specific genomic "bins" around features are statistically significant between the disease and control groups. Significant bins are highlighted to draw attention to regions of interest.
The ultimate goal is to determine if the various cfDNA features identified in the previous steps can be used to accurately distinguish between ALS and Control samples. This section builds and evaluates predictive models.
- Feature Compilation: Key statistical summaries derived from the insert size analysis (e.g., mean, median, skewness, kurtosis, nucleosomal proportions, short fragment enrichment) are combined into a single dataset.
- Model Training: Two common machine learning models, Logistic Regression and Random Forest, are trained using this compiled feature set. Repeated cross-validation is used to ensure robust model evaluation.
- Performance Evaluation: The models' ability to classify samples is assessed using standard metrics such as:
- ROC AUC (Area Under the Receiver Operating Characteristic Currve): A measure of overall discriminatory power.
- Sensitivity: The model's ability to correctly identify ALS samples.
- Specificity: The model's ability to correctly identify Control samples.
- Precision: The accuracy of positive (ALS) predictions.
- F1-Score: A balance between precision and sensitivity.
- ROC Curves and Feature Importance: ROC curves are plotted to visualize the trade-off between sensitivity and specificity, and feature importance is extracted from the Random Forest model to highlight which cfDNA characteristics are most influential in predicting disease status.
The provided data is bisulfite-treated, which allows for the detection of DNA methylation.
While not explicitly implemented in the current scripts, extracting and analyzing methylation patterns from the XM tag in the BAM files is identified as a valuable area for future research.
This would involve identifying methylated cytosines and comparing their distribution and frequency between ALS and Control samples.
Additionally, DMA analysis and methylation patterens at fragment ends.
The current analysis looks at promoters and CpG islands. You could broaden this to other functional genomic elements because they were the easiest to extract. Future work can look at Enhancers an other regulatory elements, repeat elemetns and maybe disease specific regions.
We can expand on the profiling done to look at phasing as well as nuclesome free regions.
instead of using basic features we can create more complex features that may better than the current ones used. This will increase our F1- scores we can also explore more Advanced models like SVMs and Neural networks.
Having other modalities can improve predictive models of ALS
This analysis pipeline provides a framework for exploring cfDNA fragmentation patterns and their potential as biomarkers for distinguishing disease states.
The current implementation focuses on various aspects of fragment length, genomic distribution, and end motifs, culminating in a classification module.
The results, including statistical summaries, plots, and classification metrics, are output to the Figures directory, offering valuable insights into the differences between ALS and Control cfDNA profiles.