This repository provides a complete and modular pipeline to simulate, analyze, validate, and visualize bisulfite sequencing data for accurate estimation of bisulfite conversion efficiency. It supports multiple measurement strategies, validation metrics, and context-specific analyses.
bisulfite_simulation.py– Data simulation enginebisulfite_metrics.py– Efficiency analysis and metric calculationsbisulfite_validation.py– Validation and visualization toolsbisulfite_complete_pipeline.py– Full pipeline integrationbisulfite_demo.py– Demonstration script
BisulfiteSimulatorclass with biological realism- Simulates genomes with custom GC content
- Context-specific methylation (CpG, CHG, CHH)
- Bisulfite conversion with variable efficiency
- Realistic sequencing reads with quality scores and error models
ConversionEfficiencyAnalyzersupports:- Non-CpG method – CHG/CHH cytosines as proxies
- Lambda DNA method – Simulates unmethylated spike-ins
- CHH-specific method – Focus on rarely methylated sites
- Position-specific rates – Genome-wide patterns
- Context-dependent analysis – Local sequence effects
- Bootstrap CI – Confidence interval estimation
ValidationFrameworkfor performance evaluationVisualizationSuitefor high-quality plots- Metrics include R², MAE, RMSE, correlation, and bias
- Method agreement, cross-validation, and reporting
CompleteBisulfitePipelineclass for end-to-end execution- Automates simulation → analysis → validation → output
- Includes benchmarking and method comparison
- Saves results, plots, and summary reports
- Full working example of pipeline usage
- Step-by-step walkthrough of each module
- Visualizations and interpretation of results
- Best practices and tips
| Method | Description | Notes |
|---|---|---|
| Non-CpG | Uses CHG/CHH cytosines (rarely methylated) | General-purpose, robust |
| Lambda DNA | Simulates unmethylated control (spike-in) DNA | Good for bias correction |
| CHH Context | Focuses on CHH sites (<5% methylation in mammals) | Higher variance, good for validation |
| Confidence Intervals | Bootstrap-based uncertainty estimation | Adds statistical rigor |
- Accuracy: R², MAE, RMSE, correlation
- Bias: Systematic over/under-estimation
- Precision: Within ±1%, ±2%, ±5% thresholds
- Consistency: Method agreement (coefficient of variation)
- Coverage Effect: Accuracy vs. sequencing depth
- Range Sensitivity: Efficiency-range-specific accuracy
| Category | R² | MAE |
|---|---|---|
| Acceptable | > 0.90 | < 0.02 |
| Good | > 0.95 | < 0.01 |
| Excellent | > 0.98 | < 0.005 |
- Primary Method: Use Non-CpG analysis for routine estimation
- Validation: Combine Lambda DNA spike-ins with other methods
- Coverage: Aim for ≥ 20x sequencing depth
- Controls: Include positive (unmethylated) and negative (methylated)
- Replication: Use technical replicates for sensitive samples
from bisulfite_complete_pipeline import CompleteBisulfitePipeline
# Initialize pipeline
pipeline = CompleteBisulfitePipeline(output_dir="Results_Demo")
# Run the complete analysis
results = pipeline.run_complete_analysis(
genome_length=10000,
efficiency_range=(0.90, 0.999),
n_efficiency_points=8,
coverage=30
)Important
For any questions please contact: 👉 Ashok K. Sharma; ashoks773@gmail.com or compbiosharma@gmail.com