Skip to content

New datasets#2

Merged
PelzKo merged 16 commits into
devfrom
new_datasets
Jun 12, 2026
Merged

New datasets#2
PelzKo merged 16 commits into
devfrom
new_datasets

Conversation

@PelzKo

@PelzKo PelzKo commented Jun 11, 2026

Copy link
Copy Markdown
Contributor

PR checklist

  • This comment contains a description of changes (with reason).
  • If you've fixed a bug or added code that should be tested, add tests!
  • If you've added a new tool - have you followed the pipeline conventions in the contribution docs
  • Make sure your code lints (nf-core pipelines lint).
  • Ensure the test suite passes (nextflow run . -profile test,docker --outdir <OUTDIR>).
  • Check for unexpected warnings in debug mode (nextflow run . -profile debug,test,docker --outdir <OUTDIR>).
  • Usage Documentation in docs/usage.md is updated.
  • Output Documentation in docs/output.md is updated.
  • CHANGELOG.md is updated.
  • README.md is updated (including new tool citations and authors/contributors).

@github-actions

Copy link
Copy Markdown

This PR is against the main branch ❌

  • Do not close this PR
  • Click Edit and change the base to dev
  • This CI test will remain failed until you push a new commit

Hi @PelzKo,

It looks like this pull-request is has been made against the daisybio/domainsplit main branch.
The main branch on nf-core repositories should always contain code from the latest release.
Because of this, PRs to main are only allowed if they come from the daisybio/domainsplit dev branch.

You do not need to close this PR, you can change the target branch to dev by clicking the "Edit" button at the top of this page.
Note that even after this, the test will continue to show as failing until you push a new commit.

Thanks again for your contribution!

@github-actions

github-actions Bot commented Jun 11, 2026

Copy link
Copy Markdown

nf-core pipelines lint overall result: Passed ✅ ⚠️

Posted for pipeline commit a09986f

+| ✅ 183 tests passed       |+
#| ❔  24 tests were ignored |#
!| ❗   1 tests had warnings |!
Details

❗ Test warnings:

  • readme - README contains the placeholder zenodo.XXXXXXX. This should be replaced with the zenodo doi (after the first release).

❔ Tests ignored:

  • files_exist - File is ignored: conf/igenomes.config
  • files_exist - File is ignored: conf/igenomes_ignored.config
  • files_exist - File is ignored: assets/multiqc_config.yml
  • files_exist - File is ignored: CODE_OF_CONDUCT.md
  • files_exist - File is ignored: assets/nf-core-domainsplit_logo_light.png
  • files_exist - File is ignored: docs/images/nf-core-domainsplit_logo_light.png
  • files_exist - File is ignored: docs/images/nf-core-domainsplit_logo_dark.png
  • files_exist - File is ignored: .github/ISSUE_TEMPLATE/config.yml
  • files_exist - File is ignored: .github/workflows/awstest.yml
  • files_exist - File is ignored: .github/workflows/awsfulltest.yml
  • nextflow_config - Config variable ignored: manifest.name
  • nextflow_config - Config variable ignored: manifest.homePage
  • files_unchanged - File ignored due to lint config: CODE_OF_CONDUCT.md
  • files_unchanged - File ignored due to lint config: .github/ISSUE_TEMPLATE/bug_report.yml
  • files_unchanged - File ignored due to lint config: .github/PULL_REQUEST_TEMPLATE.md
  • files_unchanged - File ignored due to lint config: assets/email_template.txt
  • files_unchanged - File ignored due to lint config: assets/sendmail_template.txt
  • files_unchanged - File ignored due to lint config: assets/nf-core-domainsplit_logo_light.png
  • files_unchanged - File ignored due to lint config: docs/images/nf-core-domainsplit_logo_light.png
  • files_unchanged - File ignored due to lint config: docs/images/nf-core-domainsplit_logo_dark.png
  • files_unchanged - File ignored due to lint config: docs/README.md
  • files_unchanged - File ignored due to lint config: .gitignore or .prettierignore
  • actions_awstest - 'awstest.yml' workflow not found: /home/runner/work/domainsplit/domainsplit/.github/workflows/awstest.yml
  • multiqc_config - multiqc_config

✅ Tests passed:

Run details

  • nf-core/tools version 4.0.2
  • Run at 2026-06-12 10:50:34

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR expands the Domainsplit pipeline’s dataset assembly and splitting logic by introducing additional DDI/PPI sources (HIPPIE, PPIDM), a revised negative-DDI generation flow (candidate-pool build + DANS selection + method-labeled insertion), and updated nf-core utility/module plumbing to support the new parameters and outputs.

Changes:

  • Refactors DDI collection into per-source insertion steps (3did, HIPPIE single-domain PPIs, PPIDM, Negatome) and adds a two-method DANS-based negative-DDI workflow.
  • Updates database/schema semantics to allow the same domain pair to coexist under multiple source labels and adjusts split strategies to run per negative-DDI method.
  • Updates nf-core utility subworkflows/modules (nf-schema plugin, params JSON dump, mmseqs module version reporting) and adds Python tests for the new negative selection/insertion logic.

Reviewed changes

Copilot reviewed 49 out of 59 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
workflows/domainsplit.nf Wires new dataset parameters into the top-level workflow and updates ProtT5 param usage.
nextflow.config Updates defaults and introduces required dataset inputs (HIPPIE/PPIDM/negative PPI parquet, seed, etc.).
nextflow_schema.json Updates schema groups/required fields and adds new parameters.
conf/modules.config Adjusts publishDir rules for new/renamed processes and publishes negative-method score TSV.
modules.json Bumps tracked nf-core module/subworkflow SHAs.
subworkflows/local/collect_ddi_data/meta.yml Documents inputs/outputs/components for the refactored DDI collection step.
subworkflows/local/collect_ddi_data/main.nf Implements ordered per-source insertion + DANS negative construction (pool + select + insert).
subworkflows/local/curate_domains/meta.yml Adds nf-core-style metadata for the curate-domains subworkflow.
subworkflows/local/enrich_ddi_database/meta.yml Adds metadata describing enrichment inputs/outputs (incl. embeddings).
subworkflows/local/enrich_ddi_database/main.nf Adjusts channel handling for the enriched DB output.
subworkflows/local/generate_embeddings/meta.yml Adds metadata for the embedding-generation subworkflow.
subworkflows/local/generate_embeddings/main.nf Updates documentation comment for ProtT5 parameter name.
subworkflows/local/split_domainsplit_database/meta.yml Updates metadata to reflect new split strategy/component naming.
subworkflows/local/split_domainsplit_database/main.nf Runs split strategies per negative-DDI method and adds external-validation test-set subsetting.
subworkflows/local/utils_nfcore_domainsplit_pipeline/meta.yml Introduces metadata for pipeline init/completion wrapper subworkflow.
subworkflows/local/utils_nfcore_domainsplit_pipeline/main.nf Updates nf-schema plugin invocation signature and cleans TODO comments.
subworkflows/nf-core/utils_nextflow_pipeline/main.nf Updates parameter-dump JSON serialization and copy logic.
subworkflows/nf-core/utils_nfschema_plugin/meta.yml Extends plugin subworkflow inputs with additional CLI/help knobs.
subworkflows/nf-core/utils_nfschema_plugin/main.nf Adapts nf-schema help/summary/validation option handling and adds cli_typecast support.
subworkflows/nf-core/utils_nfschema_plugin/tests/nextflow.config Bumps nf-schema plugin version used by tests.
subworkflows/nf-core/utils_nfschema_plugin/tests/main.nf.test Updates test input arity for new plugin-subworkflow parameter(s).
modules/nf-core/mmseqs/easycluster/main.nf Updates container-engine detection and switches to topic-based versions emission.
modules/nf-core/mmseqs/easycluster/meta.yml Updates module metadata for new versions output/topic structure and YAML formatting.
modules/nf-core/mmseqs/easycluster/tests/main.nf.test.snap Updates snapshots for new versions output format and Nextflow version.
modules/local/init_domainsplit_db/main.nf Updates UNIQUE constraint to include source.
modules/local/random_ddi_split/main.nf Adds source filtering for method-specific split generation.
modules/local/minimal_leakage_split/main.nf Adds source filtering to leakage-aware splitter.
modules/local/external_validation_split/main.nf New module to subset DDIs by source and prune resulting DB.
modules/local/external_validation_split/environment.yml Conda env for the new external-validation subset module.
modules/local/insert_3did/main.nf New process wrapper for inserting 3did positives via a Python helper.
modules/local/insert_3did/environment.yml Conda env for insert_3did.
modules/local/insert_single_domain_ppi/main.nf New process wrapper for HIPPIE single-domain PPI → positive DDI insertion.
modules/local/insert_single_domain_ppi/environment.yml Conda env for insert_single_domain_ppi.
modules/local/insert_ppidm/main.nf New process wrapper for PPIDM TSV insertion.
modules/local/insert_ppidm/environment.yml Conda env for insert_ppidm.
modules/local/insert_negatome/main.nf New process wrapper for inserting Negatome negatives.
modules/local/insert_negatome/environment.yml Conda env for insert_negatome.
modules/local/remove_self_interactions/main.nf New process to remove self-interactions from the SQLite DB.
modules/local/remove_self_interactions/environment.yml Conda env for remove_self_interactions.
modules/local/swissprot_map/main.nf New process to build SwissProt→Pfam map used by HIPPIE single-domain inference.
modules/local/swissprot_map/environment.yml Conda env for swissprot_map builder.
modules/local/build_ppi_negative_pool/main.nf Refactors old negative-DDI builder into deterministic candidate-pool dump step.
modules/local/build_ppi_negative_pool/environment.yml Adds numpy dependency for pool dump.
modules/local/select_ppi_negative_dans/main.nf New process for per-method DANS selection from the pool dump.
modules/local/select_ppi_negative_dans/environment.yml Conda env for DANS selector (python/numpy).
modules/local/insert_ppi_negative_selection/main.nf New process to insert both negative-method outputs + score TSV into SQLite.
modules/local/insert_ppi_negative_selection/environment.yml Conda env for negative selection inserter.
modules/local/insert_ddis/main.nf Removes the old monolithic INSERT_DDIS module.
bin/ddi_db_utils.py Adds shared canonicalization + insertion helpers with cross-source dedup support.
bin/insert_3did.py Implements 3did insertion using shared DB helpers.
bin/insert_single_domain_ppi.py Implements HIPPIE single-domain inference/insertion using SwissProt map.
bin/insert_ppidm.py Implements PPIDM TSV parsing/insertion with class-priority semantics.
bin/insert_negatome.py Implements Negatome parsing/insertion using shared DB helpers.
bin/build_swissprot_pfam_map.py Downloads/parses UniProt stream to create accession/name→Pfam mapping JSON.
bin/build_ppi_negative_pool.py Builds deterministic candidate pool + method-specific statistics dump for DANS.
bin/select_ppi_negative_dans.py Performs per-method uncapped DANS selection and emits pairs + score JSON.
bin/insert_ppi_negative_selection.py Inserts method-labeled positives/negatives and consolidates score reporting.
tests/python/test_select_ppi_negative_dans.py Adds unit-style checks for both DANS selection methods.
tests/python/test_insert_ppi_negative_selection.py Adds integration-style check for schema + insertion behavior across sources.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines 46 to 50
main:
file_3did = file(url_3did)
sqlite_3did = DOWNLOAD_3DID_SQLITE(file_3did).sqlite
negatome_file = DOWNLOAD_NEGATOME(url_negatome).negatome

Comment on lines +77 to +83
def jsonGenerator = new groovy.json.JsonGenerator.Options()
.excludeNulls()
.addConverter(Path) { Path path -> path.toUriString() }
.addConverter(Duration) { Duration duration -> duration.toMillis() }
.addConverter(MemoryUnit) { MemoryUnit memory -> memory.toBytes() }
.addConverter(nextflow.script.types.VersionNumber) { nextflow.script.types.VersionNumber version -> version.toString() }
.build()
Comment thread nextflow_schema.json
Comment on lines 39 to 45
"source_database_options": {
"title": "Source database URLs",
"title": "Source databases",
"type": "object",
"fa_icon": "fas fa-database",
"description": "URLs of the public source databases used to assemble the domainsplit database.",
"description": "Public source databases (URLs and local file paths) used to assemble the domainsplit database.",
"required": ["hippie_tsv", "ppidm_tsv", "negative_ppi_parquet"],
"properties": {

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@copilot Give me one example so show me how I would need to change this hippie_tsv block in the schema to make that change: "hippie_tsv": {
"type": ["string", "null"],
"format": "file-path",
"description": "Required path to a HIPPIE PPI TSV. COLLECT_DDI_DATA adds positive DDIs inferred from PPIs between two single-domain proteins.",
"default": null,
"fa_icon": "fas fa-file-import"
},

Comment thread tests/python/test_insert_ppi_negative_selection.py Outdated
@PelzKo PelzKo merged commit dfb8697 into dev Jun 12, 2026
7 checks passed
@PelzKo PelzKo deleted the new_datasets branch June 12, 2026 10:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants