Skip to content

feat(drug): download aact_extraction_batch_results in the drug step#202

Merged
d0choa merged 2 commits into
mainfrom
move-aact-batch-to-drug
Jun 11, 2026
Merged

feat(drug): download aact_extraction_batch_results in the drug step#202
d0choa merged 2 commits into
mainfrom
move-aact-batch-to-drug

Conversation

@d0choa

@d0choa d0choa commented Jun 11, 2026

Copy link
Copy Markdown
Contributor

Summary

Download the AACT clinical-trial batch extraction (aact_extraction_batch_results) in the drug step.

This input feeds chembl_molecule (opentargets/pts#142, tracking issue opentargets/issues#4414). It lives in its own standalone source — gs://ot-team/irene/clinical_mining/aact_extraction_batch_result/output — so the drug step copies it to input/clinical_report/aact_extraction_batch_results/ (the path both chembl_molecule and clinical_report read).

Because chembl_molecule already depends on pis_drug, this needs no new pts_chembl_molecule → pis_clinical_report DAG edge.

What changed

config.yaml, drug step only — adds a copy_many that copies …/aact_extraction_batch_result/output/*input/clinical_report/aact_extraction_batch_results/. The clinical_report step is unchanged (the batch is no longer nested under its input tree).

Companion changes

d0choa added a commit to opentargets/orchestration that referenced this pull request Jun 11, 2026
…_molecule dep

Mirror opentargets/pis#202: download aact_extraction_batch_results in the PIS
drug step (clinical_report glob split into top-level / aact / chembl subtrees to
exclude it). Since pts_chembl_molecule already depends on pis_drug, revert the
earlier pts_chembl_molecule -> pis_clinical_report edge — the DAG dependencies
stay as they were.
@d0choa d0choa force-pushed the move-aact-batch-to-drug branch from 6fe831d to 6824ed6 Compare June 11, 2026 09:37
The AACT clinical-trial batch extraction feeds chembl_molecule (opentargets/pts#142,
tracking opentargets/issues#4414). It lives in its own standalone source
(gs://ot-team/irene/clinical_mining/aact_extraction_batch_result/output), so the
drug step copies it (via copy_many) to input/clinical_report/aact_extraction_batch_results.

chembl_molecule already depends on pis_drug, so this needs no
pts_chembl_molecule -> pis_clinical_report DAG edge. The clinical_report step is
unchanged (the batch is no longer nested under its input tree).
@d0choa d0choa force-pushed the move-aact-batch-to-drug branch from 6824ed6 to 5ac4a3f Compare June 11, 2026 09:54
DSuveges pushed a commit to opentargets/orchestration that referenced this pull request Jun 11, 2026
…_molecule dep

Mirror opentargets/pis#202: download aact_extraction_batch_results in the PIS
drug step (clinical_report glob split into top-level / aact / chembl subtrees to
exclude it). Since pts_chembl_molecule already depends on pis_drug, revert the
earlier pts_chembl_molecule -> pis_clinical_report edge — the DAG dependencies
stay as they were.
@d0choa d0choa merged commit 2360341 into main Jun 11, 2026
3 checks passed
@d0choa d0choa deleted the move-aact-batch-to-drug branch June 11, 2026 16:42
DSuveges pushed a commit to opentargets/orchestration that referenced this pull request Jun 17, 2026
…_molecule dep

Mirror opentargets/pis#202: download aact_extraction_batch_results in the PIS
drug step (clinical_report glob split into top-level / aact / chembl subtrees to
exclude it). Since pts_chembl_molecule already depends on pis_drug, revert the
earlier pts_chembl_molecule -> pis_clinical_report edge — the DAG dependencies
stay as they were.
project-defiant pushed a commit to opentargets/orchestration that referenced this pull request Jun 17, 2026
* chore: update data and software versions

* chore: add clinical_report llm dep

* chore: update pis paths

* fix: split ontoma into two steps to avoid circular issue

* fix: gentropy version typo

* chore: download essentiality from depmap directly

* fix: point `pts_literature_publication_match` to `pts_ontoma_literature`

* revert: essentiality task cannot pull from depmap

* fix: update openfda config

* fix: add missing target dep

* fix: add missing dep for search_facet

* fix: add string_version as a pts env variable

* fix: remove qc flags from drug_molecule

* fix: typo

* fix: update essentiality filename

* fix: add pts_target to pts_evidence_postprocess_clinical_precedence deps

* chore: update pts

* fix: split openfda subtasks into independent tasks (spark job goes idle)

* fix: baseline_expression step only triggers a single spark job

* fix: add pis_heritability to unified dag

* fix: update score expression for some sources

* chore: rename pts_ontoma_literature to run on literature cluster

* perf: improve pts cluster settings to allow parallel jobs

* perf: improve pts cluster settings to allow parallel jobs

* fix: typo

* fix(gentropy): add interactions

* fix: baseline expression path typo

* chore: avoid preemptible secondary workers in literature cluster

* revert: baseline expression path typo

* chore: uncomment metrics

* fix(epmc): evidence format is parquet

* chore: rename baseline_expression_aggregated to baseline_expression

* chore(l2g): set `train_on_full_dataset` to false

* chore: update pts to check target fix

* chore: update pts to check target fix

* chore(pis): updating PanelApp data source for 2026.05.11 release

- The file has the same schema and identical format
- The new release has 33k fewer lines, which might indicate we are not getting all ratings. It might not impact the number of evidence and associations at the end.

* chore: bump chembl_version to 37

* chore: retire ETL stage from unified pipeline

The ETL stage in the unified pipeline DAG has had zero step consumers
since PR #195. Remove the now-orphan configuration, loaders, DAG stage
function, and supporting operator/enum entries:

- clusters.yaml: drop the `etl` and `etl_literature` clusters plus the
  `step_job_properties.etl` block.
- unified_pipeline.yaml: drop `etl_version` and the `etl_literature`
  step entry.
- etl.conf: deleted (no longer loaded).
- config/unified_pipeline.py: drop the `etl` AppConfig loader (and its
  PPP overlay), `etl_version`/`etl_jar_origin_uri`, the now-unused
  `jar_uri()` helper, and the `exts` map in `config_uri()`.
- dags/unified_pipeline.py: drop the `etl_stage()` function, its call,
  and the imports that only it used (`ETLJobBuilder`, `CopyBlobOperator`,
  `to_hocon`).
- operators/dataproc.py: delete `ETLJobBuilder`.
- models/step.py: drop `UnifiedPipelineStage.ETL`.
- operators/diff.py: refresh docstring examples that referenced the
  removed `etl_stage` task IDs.

* chore: enable pts_association_timeseries_view for non-PPP runs

* refactor: delete etl config

* chore(pts): propagating changes evidence_clinical_precedence config

* fix: revert testing output

* chore(uv): update lockfile

* fix(pts): literature config added

* fix(ot_crispr): study table is now exported in csv

* feat(pts): wire aact_extraction_batch_results into chembl_molecule

Mirror the PTS config change (opentargets/pts#142): the chembl_molecule step
now reads input/clinical_report/aact_extraction_batch_results to mine
clinical-trial (AACT) synonyms. That input is staged by pis_clinical_report, so
pts_chembl_molecule now also depends on it in the unified pipeline (otherwise
the DAG could run chembl_molecule before the AACT batch is present).

* refactor(pis): move aact batch download to drug step, drop the chembl_molecule dep

Mirror opentargets/pis#202: download aact_extraction_batch_results in the PIS
drug step (clinical_report glob split into top-level / aact / chembl subtrees to
exclude it). Since pts_chembl_molecule already depends on pis_drug, revert the
earlier pts_chembl_molecule -> pis_clinical_report edge — the DAG dependencies
stay as they were.

* refactor(pis): point aact glob at standalone source, drop clinical_report split

* refactor(pis): use copy_many for the aact batch download

* chore: bump pis_version to 26.06.0-dev.2 and pts_version to 26.06.0-dev.4

* chore: remove unnecessary flag

* chore: configuration updates

* chore(pts): migrate partition_count configs from pts repo

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* chore: update for new run

* fix(colocalisation): fixing gentropy tag (v3.3.0-dev.56) for cluster

* fix(clinical_target): remove 'UNVALIDATED_INDICATION' flag

Removed 'UNVALIDATED_INDICATION' from invalid clinical report QC settings.

* fix(credible_set): add `pts_target` as dependency for `isTransQtl` 

@DSuveges This change was uncommitted in Airflow. Can you confirm this is correct?

* chore(metrics): add pts_clinical_target as dependency

---------

Co-authored-by: Irene Lopez <irene.lopezs@protonmail.com>
Co-authored-by: David Ochoa <ochoa@ebi.ac.uk>
Co-authored-by: root <root@inst-builder-debian-11-build-build-8rm9w.europe-west4-b.c.gce-image-builder.internal>
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-authored-by: Irene López Santiago <45119610+ireneisdoomed@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant