New datasets#2
Conversation
This PR is against the
|
|
There was a problem hiding this comment.
Pull request overview
This PR expands the Domainsplit pipeline’s dataset assembly and splitting logic by introducing additional DDI/PPI sources (HIPPIE, PPIDM), a revised negative-DDI generation flow (candidate-pool build + DANS selection + method-labeled insertion), and updated nf-core utility/module plumbing to support the new parameters and outputs.
Changes:
- Refactors DDI collection into per-source insertion steps (3did, HIPPIE single-domain PPIs, PPIDM, Negatome) and adds a two-method DANS-based negative-DDI workflow.
- Updates database/schema semantics to allow the same domain pair to coexist under multiple
sourcelabels and adjusts split strategies to run per negative-DDI method. - Updates nf-core utility subworkflows/modules (nf-schema plugin, params JSON dump, mmseqs module version reporting) and adds Python tests for the new negative selection/insertion logic.
Reviewed changes
Copilot reviewed 49 out of 59 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| workflows/domainsplit.nf | Wires new dataset parameters into the top-level workflow and updates ProtT5 param usage. |
| nextflow.config | Updates defaults and introduces required dataset inputs (HIPPIE/PPIDM/negative PPI parquet, seed, etc.). |
| nextflow_schema.json | Updates schema groups/required fields and adds new parameters. |
| conf/modules.config | Adjusts publishDir rules for new/renamed processes and publishes negative-method score TSV. |
| modules.json | Bumps tracked nf-core module/subworkflow SHAs. |
| subworkflows/local/collect_ddi_data/meta.yml | Documents inputs/outputs/components for the refactored DDI collection step. |
| subworkflows/local/collect_ddi_data/main.nf | Implements ordered per-source insertion + DANS negative construction (pool + select + insert). |
| subworkflows/local/curate_domains/meta.yml | Adds nf-core-style metadata for the curate-domains subworkflow. |
| subworkflows/local/enrich_ddi_database/meta.yml | Adds metadata describing enrichment inputs/outputs (incl. embeddings). |
| subworkflows/local/enrich_ddi_database/main.nf | Adjusts channel handling for the enriched DB output. |
| subworkflows/local/generate_embeddings/meta.yml | Adds metadata for the embedding-generation subworkflow. |
| subworkflows/local/generate_embeddings/main.nf | Updates documentation comment for ProtT5 parameter name. |
| subworkflows/local/split_domainsplit_database/meta.yml | Updates metadata to reflect new split strategy/component naming. |
| subworkflows/local/split_domainsplit_database/main.nf | Runs split strategies per negative-DDI method and adds external-validation test-set subsetting. |
| subworkflows/local/utils_nfcore_domainsplit_pipeline/meta.yml | Introduces metadata for pipeline init/completion wrapper subworkflow. |
| subworkflows/local/utils_nfcore_domainsplit_pipeline/main.nf | Updates nf-schema plugin invocation signature and cleans TODO comments. |
| subworkflows/nf-core/utils_nextflow_pipeline/main.nf | Updates parameter-dump JSON serialization and copy logic. |
| subworkflows/nf-core/utils_nfschema_plugin/meta.yml | Extends plugin subworkflow inputs with additional CLI/help knobs. |
| subworkflows/nf-core/utils_nfschema_plugin/main.nf | Adapts nf-schema help/summary/validation option handling and adds cli_typecast support. |
| subworkflows/nf-core/utils_nfschema_plugin/tests/nextflow.config | Bumps nf-schema plugin version used by tests. |
| subworkflows/nf-core/utils_nfschema_plugin/tests/main.nf.test | Updates test input arity for new plugin-subworkflow parameter(s). |
| modules/nf-core/mmseqs/easycluster/main.nf | Updates container-engine detection and switches to topic-based versions emission. |
| modules/nf-core/mmseqs/easycluster/meta.yml | Updates module metadata for new versions output/topic structure and YAML formatting. |
| modules/nf-core/mmseqs/easycluster/tests/main.nf.test.snap | Updates snapshots for new versions output format and Nextflow version. |
| modules/local/init_domainsplit_db/main.nf | Updates UNIQUE constraint to include source. |
| modules/local/random_ddi_split/main.nf | Adds source filtering for method-specific split generation. |
| modules/local/minimal_leakage_split/main.nf | Adds source filtering to leakage-aware splitter. |
| modules/local/external_validation_split/main.nf | New module to subset DDIs by source and prune resulting DB. |
| modules/local/external_validation_split/environment.yml | Conda env for the new external-validation subset module. |
| modules/local/insert_3did/main.nf | New process wrapper for inserting 3did positives via a Python helper. |
| modules/local/insert_3did/environment.yml | Conda env for insert_3did. |
| modules/local/insert_single_domain_ppi/main.nf | New process wrapper for HIPPIE single-domain PPI → positive DDI insertion. |
| modules/local/insert_single_domain_ppi/environment.yml | Conda env for insert_single_domain_ppi. |
| modules/local/insert_ppidm/main.nf | New process wrapper for PPIDM TSV insertion. |
| modules/local/insert_ppidm/environment.yml | Conda env for insert_ppidm. |
| modules/local/insert_negatome/main.nf | New process wrapper for inserting Negatome negatives. |
| modules/local/insert_negatome/environment.yml | Conda env for insert_negatome. |
| modules/local/remove_self_interactions/main.nf | New process to remove self-interactions from the SQLite DB. |
| modules/local/remove_self_interactions/environment.yml | Conda env for remove_self_interactions. |
| modules/local/swissprot_map/main.nf | New process to build SwissProt→Pfam map used by HIPPIE single-domain inference. |
| modules/local/swissprot_map/environment.yml | Conda env for swissprot_map builder. |
| modules/local/build_ppi_negative_pool/main.nf | Refactors old negative-DDI builder into deterministic candidate-pool dump step. |
| modules/local/build_ppi_negative_pool/environment.yml | Adds numpy dependency for pool dump. |
| modules/local/select_ppi_negative_dans/main.nf | New process for per-method DANS selection from the pool dump. |
| modules/local/select_ppi_negative_dans/environment.yml | Conda env for DANS selector (python/numpy). |
| modules/local/insert_ppi_negative_selection/main.nf | New process to insert both negative-method outputs + score TSV into SQLite. |
| modules/local/insert_ppi_negative_selection/environment.yml | Conda env for negative selection inserter. |
| modules/local/insert_ddis/main.nf | Removes the old monolithic INSERT_DDIS module. |
| bin/ddi_db_utils.py | Adds shared canonicalization + insertion helpers with cross-source dedup support. |
| bin/insert_3did.py | Implements 3did insertion using shared DB helpers. |
| bin/insert_single_domain_ppi.py | Implements HIPPIE single-domain inference/insertion using SwissProt map. |
| bin/insert_ppidm.py | Implements PPIDM TSV parsing/insertion with class-priority semantics. |
| bin/insert_negatome.py | Implements Negatome parsing/insertion using shared DB helpers. |
| bin/build_swissprot_pfam_map.py | Downloads/parses UniProt stream to create accession/name→Pfam mapping JSON. |
| bin/build_ppi_negative_pool.py | Builds deterministic candidate pool + method-specific statistics dump for DANS. |
| bin/select_ppi_negative_dans.py | Performs per-method uncapped DANS selection and emits pairs + score JSON. |
| bin/insert_ppi_negative_selection.py | Inserts method-labeled positives/negatives and consolidates score reporting. |
| tests/python/test_select_ppi_negative_dans.py | Adds unit-style checks for both DANS selection methods. |
| tests/python/test_insert_ppi_negative_selection.py | Adds integration-style check for schema + insertion behavior across sources. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| main: | ||
| file_3did = file(url_3did) | ||
| sqlite_3did = DOWNLOAD_3DID_SQLITE(file_3did).sqlite | ||
| negatome_file = DOWNLOAD_NEGATOME(url_negatome).negatome | ||
|
|
| def jsonGenerator = new groovy.json.JsonGenerator.Options() | ||
| .excludeNulls() | ||
| .addConverter(Path) { Path path -> path.toUriString() } | ||
| .addConverter(Duration) { Duration duration -> duration.toMillis() } | ||
| .addConverter(MemoryUnit) { MemoryUnit memory -> memory.toBytes() } | ||
| .addConverter(nextflow.script.types.VersionNumber) { nextflow.script.types.VersionNumber version -> version.toString() } | ||
| .build() |
| "source_database_options": { | ||
| "title": "Source database URLs", | ||
| "title": "Source databases", | ||
| "type": "object", | ||
| "fa_icon": "fas fa-database", | ||
| "description": "URLs of the public source databases used to assemble the domainsplit database.", | ||
| "description": "Public source databases (URLs and local file paths) used to assemble the domainsplit database.", | ||
| "required": ["hippie_tsv", "ppidm_tsv", "negative_ppi_parquet"], | ||
| "properties": { |
There was a problem hiding this comment.
@copilot Give me one example so show me how I would need to change this hippie_tsv block in the schema to make that change: "hippie_tsv": {
"type": ["string", "null"],
"format": "file-path",
"description": "Required path to a HIPPIE PPI TSV. COLLECT_DDI_DATA adds positive DDIs inferred from PPIs between two single-domain proteins.",
"default": null,
"fa_icon": "fas fa-file-import"
},
PR checklist
nf-core pipelines lint).nextflow run . -profile test,docker --outdir <OUTDIR>).nextflow run . -profile debug,test,docker --outdir <OUTDIR>).docs/usage.mdis updated.docs/output.mdis updated.CHANGELOG.mdis updated.README.mdis updated (including new tool citations and authors/contributors).