Scripts for downloading, processing, and visualizing CLIP (Cross-Linking and Immunoprecipitation) data from Flow.bio.
-
Conda environment with required packages:
conda activate bioinf
-
API credentials: Create
credentials.jsonin this directory:{ "username": "your_email@example.com", "password": "your_password" }
The analysis requires three types of data files:
| File Type | Directory | Description |
|---|---|---|
*_UMICollapse.log |
data/ |
PCR duplication stats and crosslink counts |
*.summary_subtype_premapadjusted.tsv |
data_subtype/ |
Regional distribution (CDS, UTR, intron, etc.) |
*.genome_5mer_distribution_genome.tsv |
data_peka/ |
PEKA motif enrichment scores |
All downloaded files are prefixed with their Flow.bio sample_id for traceability:
{sample_id}_{original_filename}
Example: 12345_HNRNPC.unique_genome.dedup_UMICollapse.log
This ensures every file can be traced back to its source sample in the Flow.bio database.
This downloads UMICollapse.log files and creates filtered_data.json with basic sample info:
# Fresh download (re-fetches all public samples from API)
python download_files.py --fresh
# Or use cached sample list (faster, uses existing filtered_data.json)
python download_files.pyNote: The --fresh flag fetches all public samples from the API, which can take 10-30 minutes depending on the number of samples.
This fetches detailed metadata (purification target, experimental method, cell type, etc.) for each sample:
python enrich_sample_metadata.pypython download_additional_files.py \
--regex ".*summary_subtype_premapadjusted\.tsv" \
--dir data_subtypepython download_additional_files.py \
--regex ".*genome_5mer_distribution_genome\.tsv" \
--dir data_pekapython plot_circos.pyThis creates regional_distribution_circos_motif.png with:
- Hierarchical clustering based on PEKA motif enrichment
- Regional distribution (stacked bars)
- PCR duplication ratio
- Crosslink counts
- Dinucleotide sequence specificity
Run all download and plot generation steps:
# Activate environment
conda activate bioinf
# 1. Download main files (UMICollapse.log)
python download_files.py --fresh
# 2. Enrich metadata with full sample details
python enrich_sample_metadata.py
# 3. Download regional distribution files
python download_additional_files.py -r ".*summary_subtype_premapadjusted\.tsv" -d data_subtype
# 4. Download PEKA files
python download_additional_files.py -r ".*genome_5mer_distribution_genome\.tsv" -d data_peka
# 5. Generate plot
python plot_circos.pyDownloads main data files (UMICollapse.log) and updates metadata.
Options:
--fresh Re-fetch all public samples from API (slower but complete)
--regex Filename pattern to match (default: UMICollapse.log)
--dir Output directory (default: data)
--json Output JSON file (default: filtered_data.json)
--slurm Generate SLURM job scripts instead of downloading
--slurm-dir Directory for SLURM job scripts (default: slurm_jobs)
--no-sample-id-prefix Don't prefix filenames with sample_id (not recommended)
Fast download of additional file types using existing sample IDs.
Options:
--regex, -r Filename regex pattern (required)
--dir, -d Output directory (required)
--source, -s Source JSON with sample IDs (default: filtered_data.json)
--workers, -w Parallel workers (default: 8)
--slurm Generate SLURM job scripts instead of downloading
--slurm-dir Directory for SLURM job scripts (default: slurm_jobs)
--no-sample-id-prefix Don't prefix filenames with sample_id
Generates a CSV with sample metrics from downloaded UMICollapse log files.
python generate_sample_metrics.py --output sample_metrics.csvOutput columns:
sample_id- Flow.bio sample identifierpcr_duplication_rate- Ratio of input reads to deduplicated readsmillions_of_crosslinks- Number of deduplicated reads / 1,000,000
Generates the circos visualization.
from plot_circos import create_circos_plot
# Default settings
create_circos_plot()
# Custom filtering
create_circos_plot(
output_file='my_plot.png',
min_crosslinks=150000, # Minimum deduplicated reads
max_pcr_ratio=5 # Maximum PCR duplication ratio
)Enriches filtered_data.json with detailed sample metadata from the API.
python enrich_sample_metadata.pyThe circos plot applies these quality filters:
- Minimum crosslinks: 150,000 deduplicated reads
- Maximum PCR ratio: 5 (input reads / deduplicated reads)
| File | Description |
|---|---|
filtered_data.json |
Sample metadata and file information |
sample_metrics.csv |
Sample ID, PCR duplication rate, crosslink counts |
regional_distribution_circos_motif.png |
Main circos visualization |
pcr_duplication_histogram.png |
PCR duplication ratio distribution |
pcr_duplication_by_target.png |
Box plot by purification target |
For large downloads on HPC clusters, use the --slurm flag to generate job scripts instead of downloading directly:
# 1. Generate SLURM job scripts (no downloads yet)
python download_files.py --slurm --slurm-dir slurm_logs
# 2. Submit as array job (efficient)
sbatch slurm_logs/submit_array.sh
# 3. Monitor progress
squeue -u $USER
# 4. Once complete, generate metrics
python generate_sample_metrics.py --output sample_metrics.csvFor additional file types:
# Generate SLURM jobs for PEKA files
python download_additional_files.py \
--regex ".*genome_5mer_distribution_genome\.tsv" \
--dir data_peka \
--slurm --slurm-dir slurm_peka
# Submit
sbatch slurm_peka/submit_array.shThe SLURM scripts include:
- Individual job scripts (
dl_00000.sh, etc.) - Array job script (
submit_array.sh) - submits up to 50 jobs at a time - Log directory for stdout/stderr
If downloads are slow or failing, the API may be rate-limiting requests. Wait a few minutes and try again, or reduce the number of workers:
python download_additional_files.py -r "..." -d ... --workers 2If some samples are missing data files, they may not have been processed yet on Flow.bio. Re-run the download scripts to fetch newly available files.
If conda activate fails in scripts, try:
source ~/miniforge3/etc/profile.d/conda.sh
conda activate bioinf