This directory documents the downstream analysis workflow used for the in vivo MPRA IGVF submission.
It should be read primarily as a project record of how the analysis was performed, not as a fully portable or fully reproducible software package. Several scripts retain project-specific assumptions about sample naming, directory structure, metadata columns, and available reference files.
- record the sequence of downstream analysis steps used in this project
- preserve cleaned copies of the main scripts used for those steps
- show how the final IGVF submission tables were derived
- a general-purpose MPRA framework
- a guaranteed end-to-end reproduction package for another environment
- a complete archive of every upstream input needed to rerun the project from scratch
The documented workflow is organized into six steps.
scripts/step01_count_prefix.sh- counts 5' prefix or barcode sequences from FASTQ files
- adapted from
Count_Matrix_Kat_26_02_03/Module/A1.word_count.sh
scripts/step02_merge_counts.py- merges lane-level count tables into a sample-level barcode count matrix
- adapted from
Count_Matrix_Kat_26_02_03/Module/A2.build_count_matrix.py
scripts/step03_variant_mapping.py- maps barcode rows to variant IDs and builds variant-level count matrices
- adapted from
In_Vivo_MPRA_analysis/Module/A1_0.ct_mat.py - related sidecar utility:
scripts/step03b_estimate_variant_nbc.py
scripts/step04_pooling.py- pools replicate-level RNA and DNA counts according to the study pooling scheme
- adapted from
In_Vivo_MPRA_analysis/Module/C01.pooling_replicates.py
scripts/step05_mpra_analysis.R- runs activity and allelic MPRA analyses on pooled counts
- adapted from
In_Vivo_MPRA_analysis/Module/C02.mpra_analysis.R
scripts/step06_format_igvf_outputs.py- formats the project results into IGVF submission tables
scripts/: cleaned step-level scripts reflecting the project workflowconfig/: config templates used by the formatter or example wrapperpooling_scheme.csv: local copy of the pooling assignment tableoutput/: example final IGVF output filesrun_all_steps.sh: project-specific example wrapper showing step orderinput.md: notes on upstream files used by the analysisscripts/run_step03b_estimate_variant_nbc.sbatch: separate Slurm entrypoint for optionaln_bcestimation
This workflow ultimately feeds five submission files:
Reporter_Experiment.tsv.gzReporter_Element.tsv.gzReporter_Variant.tsv.gzReporter_Genomic_Element_Effect.bed.gzReporter_Genomic_Variant_Effect.bed.gz
Example output files are included under output/ as references.
- The scripts in this directory were cleaned to make the flow easier to follow, but they still mirror project-specific conventions from the original analysis.
- Sample naming assumptions such as
_CORT_Rand_CORT_Dare preserved where they were part of the original workflow. - Some runtime inputs referenced by the scripts are not included in this directory.
run_all_steps.shis best understood as an execution example for this project, not as a claim that the directory is self-contained.n_bcestimation is treated as a separate sidecar calculation rather than a required part of the display pipeline.- The IGVF formatter depends on finalized input paths and field mappings that come from this specific submission context.
The commands below illustrate how each step was invoked in this project style. They are examples, not a portability guarantee.
bash scripts/step01_count_prefix.sh \
--json-file config/fastq_dict.json \
--output-dir output/counts \
--prefix-length 20 \
--jobs 8Requirements:
jqparallel
python scripts/step02_merge_counts.py \
--count-dir output/counts \
--fastq-dict config/fastq_dict.json \
--output output/count_matrix_merged.csvOptional flags:
--non-strict-sample-match--log-level DEBUG|INFO|WARNING|ERROR
python scripts/step03_variant_mapping.py \
--count-matrix-merged output/count_matrix_merged.csv \
--barcode-mapping-vb config/barcode_mapping_vb.csv \
--barcode-mapping-gvvc1 config/barcode_mapping_gvvc1.csv \
--barcode-mapping-gvvc2 config/barcode_mapping_gvvc2.csv \
--output-with-variant output/count_matrix_merged_with_variant.csv \
--output-variant-full output/count_matrix_variant_full.csv \
--output-variant-filtered output/count_matrix_variant_filtered.csvpython scripts/step04_pooling.py \
--input output/count_matrix_merged_with_variant.csv \
--output output/pooling \
--prefix pooling_scheme_3 \
--pooling_scheme pooling_scheme.csvRscript scripts/step05_mpra_analysis.R \
--input_file output/pooling \
--prefix pooling_scheme_3 \
--config_json config/analysis_config.json \
--output_dir output/pooling_scheme_3 \
--num_cores 8python scripts/step06_format_igvf_outputs.py \
--config-json config/step6_paths.template.jsonbash run_all_steps.shThis wrapper shows the intended order of operations for this project. It assumes that the required project-specific config files and upstream resources already exist.
python -m pip install -r requirements.txt