feat(chembl_molecule): labelled synonyms/tradeNames + AACT clinical-trial synonym feed#142
Conversation
_prepare_drug_list now projects synonyms.label / tradeNames.label so all inputs to the array() flatten are array<string>, matching the updated chembl_molecule/drug_molecule struct schema. Test fixtures updated to the struct schema to catch this class of regression in future.
…nyms module
Move the clinical-trial synonym mining out of chembl_molecule.py into a
dedicated aact_synonyms.py, with the shared {label, source} struct primitives
in common/labels.py (avoids an import cycle). chembl_molecule.py drops from
731 to 362 lines; the mining tests move to test_aact_synonyms.py.
Also fixes the entry-id determinism in _anchor_candidates: the anchoring used
f.monotonically_increasing_id(), which is nondeterministic across re-evaluations
and could let the poisoned/anchors branches see inconsistent ids under Spark
adaptive re-planning. Replaced with a deterministic sha2 key over
(nct_id + sorted member set). No behaviour change in tests; matters at scale.
Move labels.py and aact_synonyms.py into src/pts/pyspark/drug_utils/ (out of the pyspark root and common/) so the drug-specific helpers live together. Imports in chembl_molecule and the tests updated accordingly. No behaviour change.
✅ Validated end-to-end on Dataproc (run-001)Ran the Schema — both columns now
The "many molecules with synonyms from two sources" goal is met (2,348). The 8,038 pairs are the same order of magnitude as the experiment's design target (~few thousand at Spot-check of mined synonyms (all correct):
Operational notes:
|
Codex review caught a real gap: rule #8 rewrote a descriptor phrase (e.g. "akt inhibitor mk2206") to its bare code AFTER anchoring/status classification, so the rewritten code carried a stale status. A code that is actually a parent/child of the anchor (canonical vs salt) or already on the anchor slipped through as NOVEL and could be added. Move code extraction into a dedicated _rewrite_and_reclassify_codes stage (between anchoring and cleanup) that re-resolves the rewritten code against the ChEMBL name index: drops it if it is now on the anchor or newly over-ambiguous, and recomputes NOVEL/PARENT_CHILD/CONFLICT so parent/child codes are dropped downstream. _apply_cleanup_rules no longer does extraction. CONFLICT is still kept (per the approved spec). New tests cover the redundant/parent-child/conflict reclassification paths.
Codex review (gpt-5.5, high effort) — outcomeRan a Codex review of the branch. No Critical findings; it confirmed no remaining Fixed — descriptor-code reclassificationCodex correctly flagged that rule #8 rewrote a descriptor phrase to its bare R&D code ( Fixed in Investigated — keeping CONFLICT (no change)Codex questioned keeping
The CONFLICT-like cases are overwhelmingly same-drug-family variants ChEMBL models as separate molecules without a Decision: keep CONFLICT as-is. A higher |
DSuveges
left a comment
There was a problem hiding this comment.
Everything seems sensible to me.
The AACT clinical-trial batch extraction feeds chembl_molecule (opentargets/pts#142, tracking opentargets/issues#4414). It lives in its own standalone source (gs://ot-team/irene/clinical_mining/aact_extraction_batch_result/output), so the drug step downloads it to input/clinical_report/aact_extraction_batch_results. chembl_molecule already depends on pis_drug, so this needs no pts_chembl_molecule -> pis_clinical_report DAG edge. The clinical_report step is unchanged (the batch is no longer nested under its input tree).
The AACT clinical-trial batch extraction feeds chembl_molecule (opentargets/pts#142, tracking opentargets/issues#4414). It lives in its own standalone source (gs://ot-team/irene/clinical_mining/aact_extraction_batch_result/output), so the drug step copies it (via copy_many) to input/clinical_report/aact_extraction_batch_results. chembl_molecule already depends on pis_drug, so this needs no pts_chembl_molecule -> pis_clinical_report DAG edge. The clinical_report step is unchanged (the batch is no longer nested under its input tree).
…ields (#47) drug_molecule's `synonyms` and `tradeNames` changed from array<string> to array<struct<label, source>> (opentargets/pts#142, tracking opentargets/issues#4414). The Parquet introspection picks up the new subfields automatically, but they had no curation, so they would publish with PLACEHOLDER descriptions. Add curation entries for synonyms/label, synonyms/source, tradeNames/label and tradeNames/source, mirroring the existing crossReferences/ids + /source pattern.
Mirror the PTS config change (opentargets/pts#142): the chembl_molecule step now reads input/clinical_report/aact_extraction_batch_results to mine clinical-trial (AACT) synonyms. That input is staged by pis_clinical_report, so pts_chembl_molecule now also depends on it in the unified pipeline (otherwise the DAG could run chembl_molecule before the AACT batch is present).
ireneisdoomed
left a comment
There was a problem hiding this comment.
Impact on latest clinical mining data: + 3048 drug/disease pairs fully mapped (240 new drugs, 943 diseases)
Great addition, thanks :)
…202) * feat(drug): download aact_extraction_batch_results in the drug step The AACT clinical-trial batch extraction feeds chembl_molecule (opentargets/pts#142, tracking opentargets/issues#4414). It lives in its own standalone source (gs://ot-team/irene/clinical_mining/aact_extraction_batch_result/output), so the drug step copies it (via copy_many) to input/clinical_report/aact_extraction_batch_results. chembl_molecule already depends on pis_drug, so this needs no pts_chembl_molecule -> pis_clinical_report DAG edge. The clinical_report step is unchanged (the batch is no longer nested under its input tree). * chore: bump version to 26.06.6
Mirror the PTS config change (opentargets/pts#142): the chembl_molecule step now reads input/clinical_report/aact_extraction_batch_results to mine clinical-trial (AACT) synonyms. That input is staged by pis_clinical_report, so pts_chembl_molecule now also depends on it in the unified pipeline (otherwise the DAG could run chembl_molecule before the AACT batch is present).
…arrays drug_molecule's `synonyms` and `tradeNames` changed from array<string> to array<struct<label, source>> (opentargets/pts#142, tracking opentargets/issues#4414). The drug_log ClickHouse table ingests output/drug_molecule directly, so its column DDL must match the parquet — loading the struct arrays into Array(String) columns would fail. Update both to Array(Tuple(label String, source String)), mirroring the existing crossReferences tuple. The final drug table (postload SELECT *) inherits the new types, so no other change is needed.
drug_molecule's `synonyms` and `tradeNames` changed from array<string> to array<struct<label, source>> (opentargets/pts#142, tracking opentargets/issues#4414), and the POS ClickHouse drug table now stores Array(Tuple(label, source)) (opentargets/pos#130). Read both as Seq[LabelAndSource] (the existing type Target already uses) — the ClickHouse JSON read path parses the named tuple as a {label, source} object, the same way crossReferences (Tuple(source, ids)) already works. The Drug GraphQL type is derived, so the field type auto-updates to [LabelAndSource]; the synonyms/ tradeNames field docs are updated to describe the provenance.
* chore: update data and software versions * chore: add clinical_report llm dep * chore: update pis paths * fix: split ontoma into two steps to avoid circular issue * fix: gentropy version typo * chore: download essentiality from depmap directly * fix: point `pts_literature_publication_match` to `pts_ontoma_literature` * revert: essentiality task cannot pull from depmap * fix: update openfda config * fix: add missing target dep * fix: add missing dep for search_facet * fix: add string_version as a pts env variable * fix: remove qc flags from drug_molecule * fix: typo * fix: update essentiality filename * fix: add pts_target to pts_evidence_postprocess_clinical_precedence deps * chore: update pts * fix: split openfda subtasks into independent tasks (spark job goes idle) * fix: baseline_expression step only triggers a single spark job * fix: add pis_heritability to unified dag * fix: update score expression for some sources * chore: rename pts_ontoma_literature to run on literature cluster * perf: improve pts cluster settings to allow parallel jobs * perf: improve pts cluster settings to allow parallel jobs * fix: typo * fix(gentropy): add interactions * fix: baseline expression path typo * chore: avoid preemptible secondary workers in literature cluster * revert: baseline expression path typo * chore: uncomment metrics * fix(epmc): evidence format is parquet * chore: rename baseline_expression_aggregated to baseline_expression * chore(l2g): set `train_on_full_dataset` to false * chore: update pts to check target fix * chore: update pts to check target fix * chore(pis): updating PanelApp data source for 2026.05.11 release - The file has the same schema and identical format - The new release has 33k fewer lines, which might indicate we are not getting all ratings. It might not impact the number of evidence and associations at the end. * chore: bump chembl_version to 37 * chore: retire ETL stage from unified pipeline The ETL stage in the unified pipeline DAG has had zero step consumers since PR #195. Remove the now-orphan configuration, loaders, DAG stage function, and supporting operator/enum entries: - clusters.yaml: drop the `etl` and `etl_literature` clusters plus the `step_job_properties.etl` block. - unified_pipeline.yaml: drop `etl_version` and the `etl_literature` step entry. - etl.conf: deleted (no longer loaded). - config/unified_pipeline.py: drop the `etl` AppConfig loader (and its PPP overlay), `etl_version`/`etl_jar_origin_uri`, the now-unused `jar_uri()` helper, and the `exts` map in `config_uri()`. - dags/unified_pipeline.py: drop the `etl_stage()` function, its call, and the imports that only it used (`ETLJobBuilder`, `CopyBlobOperator`, `to_hocon`). - operators/dataproc.py: delete `ETLJobBuilder`. - models/step.py: drop `UnifiedPipelineStage.ETL`. - operators/diff.py: refresh docstring examples that referenced the removed `etl_stage` task IDs. * chore: enable pts_association_timeseries_view for non-PPP runs * refactor: delete etl config * chore(pts): propagating changes evidence_clinical_precedence config * fix: revert testing output * chore(uv): update lockfile * fix(pts): literature config added * fix(ot_crispr): study table is now exported in csv * feat(pts): wire aact_extraction_batch_results into chembl_molecule Mirror the PTS config change (opentargets/pts#142): the chembl_molecule step now reads input/clinical_report/aact_extraction_batch_results to mine clinical-trial (AACT) synonyms. That input is staged by pis_clinical_report, so pts_chembl_molecule now also depends on it in the unified pipeline (otherwise the DAG could run chembl_molecule before the AACT batch is present). * refactor(pis): move aact batch download to drug step, drop the chembl_molecule dep Mirror opentargets/pis#202: download aact_extraction_batch_results in the PIS drug step (clinical_report glob split into top-level / aact / chembl subtrees to exclude it). Since pts_chembl_molecule already depends on pis_drug, revert the earlier pts_chembl_molecule -> pis_clinical_report edge — the DAG dependencies stay as they were. * refactor(pis): point aact glob at standalone source, drop clinical_report split * refactor(pis): use copy_many for the aact batch download * chore: bump pis_version to 26.06.0-dev.2 and pts_version to 26.06.0-dev.4 * chore: remove unnecessary flag * chore: configuration updates * chore(pts): migrate partition_count configs from pts repo Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * chore: update for new run * fix(colocalisation): fixing gentropy tag (v3.3.0-dev.56) for cluster * fix(clinical_target): remove 'UNVALIDATED_INDICATION' flag Removed 'UNVALIDATED_INDICATION' from invalid clinical report QC settings. * fix(credible_set): add `pts_target` as dependency for `isTransQtl` @DSuveges This change was uncommitted in Airflow. Can you confirm this is correct? * chore(metrics): add pts_clinical_target as dependency --------- Co-authored-by: Irene Lopez <irene.lopezs@protonmail.com> Co-authored-by: David Ochoa <ochoa@ebi.ac.uk> Co-authored-by: root <root@inst-builder-debian-11-build-build-8rm9w.europe-west4-b.c.gce-image-builder.internal> Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com> Co-authored-by: Irene López Santiago <45119610+ireneisdoomed@users.noreply.github.com>
Tracking issue: opentargets/issues#4414
Summary
Adds source provenance to drug synonyms and a second synonym source mined from clinical trials.
chembl_molecule'ssynonymsandtradeNameschange fromarray<string>toarray<struct<label, source>>(mirroringoutput/target). Every existing ChEMBL name is taggedsource: "ChEMBL". The struct propagates throughdrug_molecule(pure passthrough) to the finaloutput/drug_molecule.chembl_molecule.py) of the priorwork/clinical_pairs/experiment, integrated as a real pipeline source:aact_extraction_batch_results) → normalized drug member sets per trialnystatin/cellceptsurvive, strength/%, single-char, regimen suppression, descriptor-code extraction, plural suppression) andNOVEL/PARENT_CHILD/CONFLICTstatus{label, source: "AACT"}, deduped case-insensitively against existing ChEMBL labelsnamefallback only ever picks a ChEMBL-source synonym, so an AACT label never becomes a molecule's display namesearch.pyflattenssynonyms.label/tradeNames.label;openfda.pyupdated to extract.labelbefore flattening (it consumesdrug_moleculeand would otherwise break on the struct schema; its stalearray<string>test fixtures were also updated).aact_extraction_batch_results: input/clinical_report/aact_extraction_batch_resultswired into thechembl_moleculestep (the same PIS-provided inputclinical_reportalready uses; upstream of everything, no dependency cycle).End state: many molecules carry synonyms from two sources (
ChEMBL+AACT).Design & plan live in the Obsidian vault:
Areas/Work/Projects/Project-PTS/specs/2026-06-10-chembl-molecule-labelled-synonyms-aact-design.md(+-plan.md).Test Plan
make test— full suite green (396 passed)uv run ruff check— cleann_trialsgate, case-insensitive merge dedup, end-to-end two-source molecule, openfda struct fixil/26.06.0-dev0, ~5 min): 2,348 molecules with both ChEMBL + AACT synonyms, 8,038(id, label)pairs atn_trials ≥ 2, 0 non-ChEMBL tradeNames — see the validation comment below for the full tableNotes