Skip to content

feat(chembl_molecule): labelled synonyms/tradeNames + AACT clinical-trial synonym feed#142

Merged
d0choa merged 18 commits into
mainfrom
chembl-labelled-synonyms-aact
Jun 11, 2026
Merged

feat(chembl_molecule): labelled synonyms/tradeNames + AACT clinical-trial synonym feed#142
d0choa merged 18 commits into
mainfrom
chembl-labelled-synonyms-aact

Conversation

@d0choa

@d0choa d0choa commented Jun 10, 2026

Copy link
Copy Markdown
Contributor

Tracking issue: opentargets/issues#4414

Summary

Adds source provenance to drug synonyms and a second synonym source mined from clinical trials.

  • Schema changechembl_molecule's synonyms and tradeNames change from array<string> to array<struct<label, source>> (mirroring output/target). Every existing ChEMBL name is tagged source: "ChEMBL". The struct propagates through drug_molecule (pure passthrough) to the final output/drug_molecule.
  • AACT clinical-trial synonym feed — a PySpark port (inline in chembl_molecule.py) of the prior work/clinical_pairs/ experiment, integrated as a real pipeline source:
    • parse the OpenAI batch JSONL (aact_extraction_batch_results) → normalized drug member sets per trial
    • anchor members against an in-step ChEMBL name index (with a ≤10 ambiguity cap)
    • apply the experiment's 11 cleanup rules (control noise, drug-class/cell-therapy via word-boundary matching so real drugs like nystatin/cellcept survive, strength/%, single-char, regimen suppression, descriptor-code extraction, plural suppression) and NOVEL/PARENT_CHILD/CONFLICT status
    • keep candidates seen in ≥ 2 distinct trials, append as {label, source: "AACT"}, deduped case-insensitively against existing ChEMBL labels
    • the name fallback only ever picks a ChEMBL-source synonym, so an AACT label never becomes a molecule's display name
  • Downstreamsearch.py flattens synonyms.label/tradeNames.label; openfda.py updated to extract .label before flattening (it consumes drug_molecule and would otherwise break on the struct schema; its stale array<string> test fixtures were also updated).
  • Configaact_extraction_batch_results: input/clinical_report/aact_extraction_batch_results wired into the chembl_molecule step (the same PIS-provided input clinical_report already uses; upstream of everything, no dependency cycle).

End state: many molecules carry synonyms from two sources (ChEMBL + AACT).

Design & plan live in the Obsidian vault: Areas/Work/Projects/Project-PTS/specs/2026-06-10-chembl-molecule-labelled-synonyms-aact-design.md (+ -plan.md).

Test Plan

  • make test — full suite green (396 passed)
  • uv run ruff check — clean
  • Unit coverage: struct schema, batch parse (incl. malformed-line drop), anchor indexes, anchoring + status, all cleanup rules (incl. word-boundary guard), n_trials gate, case-insensitive merge dedup, end-to-end two-source molecule, openfda struct fix
  • End-to-end Dataproc run (run-001, il/26.06.0-dev0, ~5 min): 2,348 molecules with both ChEMBL + AACT synonyms, 8,038 (id, label) pairs at n_trials ≥ 2, 0 non-ChEMBL tradeNames — see the validation comment below for the full table
  • Scale check: anchoring ran fine on a small 2-worker cluster; embedded-newline AACT object names were a non-issue for Spark's GCS connector

Notes

  • This integrates the clinical-trial synonym feed only. The earlier Probes&Drugs synonym-expansion effort is intentionally out of scope.
  • Stored AACT labels are the normalized form in v1; a "most-frequent surface form" refinement is a noted follow-up.

d0choa added 14 commits June 10, 2026 13:46
_prepare_drug_list now projects synonyms.label / tradeNames.label so all
inputs to the array() flatten are array<string>, matching the updated
chembl_molecule/drug_molecule struct schema. Test fixtures updated to the
struct schema to catch this class of regression in future.
…nyms module

Move the clinical-trial synonym mining out of chembl_molecule.py into a
dedicated aact_synonyms.py, with the shared {label, source} struct primitives
in common/labels.py (avoids an import cycle). chembl_molecule.py drops from
731 to 362 lines; the mining tests move to test_aact_synonyms.py.

Also fixes the entry-id determinism in _anchor_candidates: the anchoring used
f.monotonically_increasing_id(), which is nondeterministic across re-evaluations
and could let the poisoned/anchors branches see inconsistent ids under Spark
adaptive re-planning. Replaced with a deterministic sha2 key over
(nct_id + sorted member set). No behaviour change in tests; matters at scale.
Move labels.py and aact_synonyms.py into src/pts/pyspark/drug_utils/ (out of the
pyspark root and common/) so the drug-specific helpers live together. Imports in
chembl_molecule and the tests updated accordingly. No behaviour change.
@d0choa

d0choa commented Jun 10, 2026

Copy link
Copy Markdown
Contributor Author

✅ Validated end-to-end on Dataproc (run-001)

Ran the chembl_molecule step on real il/26.06.0-dev0 inputs (drug JSONL + drugbank + the AACT clinical-trial batch extraction). Single step, 1× n1-standard-4 master + 2× n1-highmem-8 workers, ~5 min job. Output: gs://ot-team/ochoa/chembl_molecule_runs/run-001/intermediate/chembl_molecule.

Schema — both columns now array<struct<label:string, source:string>>.

metric value
total molecules 2,878,135
molecules with an AACT synonym 2,393
molecules with BOTH ChEMBL + AACT 2,348
distinct (id, AACT label) pairs (n_trials ≥ 2) 8,038
tradeNames with a non-ChEMBL source 0

The "many molecules with synonyms from two sources" goal is met (2,348). The 8,038 pairs are the same order of magnitude as the experiment's design target (~few thousand at n_trials ≥ 2), slightly higher — consistent with CONFLICT cross-pollination (an abbreviation anchoring to more than one molecule) and a different release than the original experiment.

Spot-check of mined synonyms (all correct):

  • ATAZANAVIR → atv, atz, boosted atazanavir, atazanavir/ritonavir
  • TANDUTINIB → mln518 · SABARUBICIN → bms-195615, men-10755 (R&D codes — descriptor-code extraction working)
  • FOLINIC ACID → leucovorin, lv, calcium folinate, folfiri
  • CARBIDOPA → carbidopa/levodopa, sinemet cr, lcig · AVIBACTAM → ceftazidime-avibactam, caz-avi

Operational notes:

  • The embedded-newline AACT batch object names were a non-issue for Spark's GCS connector.
  • The deterministic sha2 entry key (replacing monotonically_increasing_id) ran in production without issue.

Codex review caught a real gap: rule #8 rewrote a descriptor phrase (e.g.
"akt inhibitor mk2206") to its bare code AFTER anchoring/status classification,
so the rewritten code carried a stale status. A code that is actually a
parent/child of the anchor (canonical vs salt) or already on the anchor slipped
through as NOVEL and could be added.

Move code extraction into a dedicated _rewrite_and_reclassify_codes stage
(between anchoring and cleanup) that re-resolves the rewritten code against the
ChEMBL name index: drops it if it is now on the anchor or newly over-ambiguous,
and recomputes NOVEL/PARENT_CHILD/CONFLICT so parent/child codes are dropped
downstream. _apply_cleanup_rules no longer does extraction. CONFLICT is still
kept (per the approved spec). New tests cover the redundant/parent-child/conflict
reclassification paths.
@d0choa

d0choa commented Jun 10, 2026

Copy link
Copy Markdown
Contributor Author

Codex review (gpt-5.5, high effort) — outcome

Ran a Codex review of the branch. No Critical findings; it confirmed no remaining array<string> consumer of these fields. Two substantive items, both addressed:

Fixed — descriptor-code reclassification

Codex correctly flagged that rule #8 rewrote a descriptor phrase to its bare R&D code (akt inhibitor mk2206mk2206) after anchoring/status classification, leaving a stale status — so a code that's actually a parent/child of the anchor (canonical vs salt) or already on the anchor could slip through as NOVEL.

Fixed in fix(aact): re-resolve descriptor-extracted codes before the trial gate: code extraction now lives in a dedicated _rewrite_and_reclassify_codes stage between anchoring and cleanup that re-resolves the rewritten code, drops it if it's now on the anchor or newly over-ambiguous, and recomputes NOVEL/PARENT_CHILD/CONFLICT. New unit tests cover the redundant / parent-child / conflict paths. On the real data (run-002) this changed no output counts — it's a correctness safeguard that costs no coverage.

Investigated — keeping CONFLICT (no change)

Codex questioned keeping CONFLICT candidates (cross-molecule labels). The spec deliberately keeps them; I quantified it against run-002:

count share
total AACT (id, label) pairs 8,038
NOVEL-like (label nowhere else in ChEMBL) 7,040 87.6%
CONFLICT-like (label is a ChEMBL name elsewhere) 998 12.4%
…combination-looking (/, "with", "plus") 3 0.04%

The CONFLICT-like cases are overwhelmingly same-drug-family variants ChEMBL models as separate molecules without a parentId linkabirateroneabiraterone acetate, adefoviradefovir dipivoxil, adalimumab across biosimilar entries, 5-fu under a 5-FU prodrug, stereoisomers. These are real synonyms; genuine cross-drug pollution is rare (3 obvious co-administration cases out of 8,038).

Decision: keep CONFLICT as-is. A higher MIN_CONFLICT_TRIALS would drop legitimate salt/prodrug/biosimilar synonyms (the bulk of the 998) to remove a handful of junk — net-negative. The data validates the spec's original "considered-and-rejected" reasoning.

Comment thread src/pts/pyspark/drug_utils/aact_synonyms.py
DSuveges
DSuveges previously approved these changes Jun 11, 2026

@DSuveges DSuveges left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Everything seems sensible to me.

d0choa added a commit to opentargets/pis that referenced this pull request Jun 11, 2026
The AACT clinical-trial batch extraction feeds chembl_molecule (opentargets/pts#142,
tracking opentargets/issues#4414). It lives in its own standalone source
(gs://ot-team/irene/clinical_mining/aact_extraction_batch_result/output), so the
drug step downloads it to input/clinical_report/aact_extraction_batch_results.

chembl_molecule already depends on pis_drug, so this needs no
pts_chembl_molecule -> pis_clinical_report DAG edge. The clinical_report step is
unchanged (the batch is no longer nested under its input tree).
d0choa added a commit to opentargets/pis that referenced this pull request Jun 11, 2026
The AACT clinical-trial batch extraction feeds chembl_molecule (opentargets/pts#142,
tracking opentargets/issues#4414). It lives in its own standalone source
(gs://ot-team/irene/clinical_mining/aact_extraction_batch_result/output), so the
drug step copies it (via copy_many) to input/clinical_report/aact_extraction_batch_results.

chembl_molecule already depends on pis_drug, so this needs no
pts_chembl_molecule -> pis_clinical_report DAG edge. The clinical_report step is
unchanged (the batch is no longer nested under its input tree).
d0choa added a commit to opentargets/ot_croissant that referenced this pull request Jun 11, 2026
…ields (#47)

drug_molecule's `synonyms` and `tradeNames` changed from array<string> to
array<struct<label, source>> (opentargets/pts#142, tracking opentargets/issues#4414).
The Parquet introspection picks up the new subfields automatically, but they had
no curation, so they would publish with PLACEHOLDER descriptions.

Add curation entries for synonyms/label, synonyms/source, tradeNames/label and
tradeNames/source, mirroring the existing crossReferences/ids + /source pattern.
DSuveges pushed a commit to opentargets/orchestration that referenced this pull request Jun 11, 2026
Mirror the PTS config change (opentargets/pts#142): the chembl_molecule step
now reads input/clinical_report/aact_extraction_batch_results to mine
clinical-trial (AACT) synonyms. That input is staged by pis_clinical_report, so
pts_chembl_molecule now also depends on it in the unified pipeline (otherwise
the DAG could run chembl_molecule before the AACT batch is present).
ireneisdoomed
ireneisdoomed previously approved these changes Jun 11, 2026

@ireneisdoomed ireneisdoomed left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Impact on latest clinical mining data: + 3048 drug/disease pairs fully mapped (240 new drugs, 943 diseases)

Great addition, thanks :)

@d0choa d0choa merged commit d8280c3 into main Jun 11, 2026
2 checks passed
@d0choa d0choa deleted the chembl-labelled-synonyms-aact branch June 11, 2026 16:41
d0choa added a commit to opentargets/pis that referenced this pull request Jun 11, 2026
…202)

* feat(drug): download aact_extraction_batch_results in the drug step

The AACT clinical-trial batch extraction feeds chembl_molecule (opentargets/pts#142,
tracking opentargets/issues#4414). It lives in its own standalone source
(gs://ot-team/irene/clinical_mining/aact_extraction_batch_result/output), so the
drug step copies it (via copy_many) to input/clinical_report/aact_extraction_batch_results.

chembl_molecule already depends on pis_drug, so this needs no
pts_chembl_molecule -> pis_clinical_report DAG edge. The clinical_report step is
unchanged (the batch is no longer nested under its input tree).

* chore: bump version to 26.06.6
DSuveges pushed a commit to opentargets/orchestration that referenced this pull request Jun 17, 2026
Mirror the PTS config change (opentargets/pts#142): the chembl_molecule step
now reads input/clinical_report/aact_extraction_batch_results to mine
clinical-trial (AACT) synonyms. That input is staged by pis_clinical_report, so
pts_chembl_molecule now also depends on it in the unified pipeline (otherwise
the DAG could run chembl_molecule before the AACT batch is present).
remo87 pushed a commit to opentargets/pos that referenced this pull request Jun 17, 2026
…arrays

drug_molecule's `synonyms` and `tradeNames` changed from array<string> to
array<struct<label, source>> (opentargets/pts#142, tracking opentargets/issues#4414).
The drug_log ClickHouse table ingests output/drug_molecule directly, so its
column DDL must match the parquet — loading the struct arrays into Array(String)
columns would fail.

Update both to Array(Tuple(label String, source String)), mirroring the existing
crossReferences tuple. The final drug table (postload SELECT *) inherits the new
types, so no other change is needed.
remo87 pushed a commit to opentargets/platform-api that referenced this pull request Jun 17, 2026
drug_molecule's `synonyms` and `tradeNames` changed from array<string> to
array<struct<label, source>> (opentargets/pts#142, tracking opentargets/issues#4414),
and the POS ClickHouse drug table now stores Array(Tuple(label, source))
(opentargets/pos#130).

Read both as Seq[LabelAndSource] (the existing type Target already uses) — the
ClickHouse JSON read path parses the named tuple as a {label, source} object, the
same way crossReferences (Tuple(source, ids)) already works. The Drug GraphQL type
is derived, so the field type auto-updates to [LabelAndSource]; the synonyms/
tradeNames field docs are updated to describe the provenance.
project-defiant pushed a commit to opentargets/orchestration that referenced this pull request Jun 17, 2026
* chore: update data and software versions

* chore: add clinical_report llm dep

* chore: update pis paths

* fix: split ontoma into two steps to avoid circular issue

* fix: gentropy version typo

* chore: download essentiality from depmap directly

* fix: point `pts_literature_publication_match` to `pts_ontoma_literature`

* revert: essentiality task cannot pull from depmap

* fix: update openfda config

* fix: add missing target dep

* fix: add missing dep for search_facet

* fix: add string_version as a pts env variable

* fix: remove qc flags from drug_molecule

* fix: typo

* fix: update essentiality filename

* fix: add pts_target to pts_evidence_postprocess_clinical_precedence deps

* chore: update pts

* fix: split openfda subtasks into independent tasks (spark job goes idle)

* fix: baseline_expression step only triggers a single spark job

* fix: add pis_heritability to unified dag

* fix: update score expression for some sources

* chore: rename pts_ontoma_literature to run on literature cluster

* perf: improve pts cluster settings to allow parallel jobs

* perf: improve pts cluster settings to allow parallel jobs

* fix: typo

* fix(gentropy): add interactions

* fix: baseline expression path typo

* chore: avoid preemptible secondary workers in literature cluster

* revert: baseline expression path typo

* chore: uncomment metrics

* fix(epmc): evidence format is parquet

* chore: rename baseline_expression_aggregated to baseline_expression

* chore(l2g): set `train_on_full_dataset` to false

* chore: update pts to check target fix

* chore: update pts to check target fix

* chore(pis): updating PanelApp data source for 2026.05.11 release

- The file has the same schema and identical format
- The new release has 33k fewer lines, which might indicate we are not getting all ratings. It might not impact the number of evidence and associations at the end.

* chore: bump chembl_version to 37

* chore: retire ETL stage from unified pipeline

The ETL stage in the unified pipeline DAG has had zero step consumers
since PR #195. Remove the now-orphan configuration, loaders, DAG stage
function, and supporting operator/enum entries:

- clusters.yaml: drop the `etl` and `etl_literature` clusters plus the
  `step_job_properties.etl` block.
- unified_pipeline.yaml: drop `etl_version` and the `etl_literature`
  step entry.
- etl.conf: deleted (no longer loaded).
- config/unified_pipeline.py: drop the `etl` AppConfig loader (and its
  PPP overlay), `etl_version`/`etl_jar_origin_uri`, the now-unused
  `jar_uri()` helper, and the `exts` map in `config_uri()`.
- dags/unified_pipeline.py: drop the `etl_stage()` function, its call,
  and the imports that only it used (`ETLJobBuilder`, `CopyBlobOperator`,
  `to_hocon`).
- operators/dataproc.py: delete `ETLJobBuilder`.
- models/step.py: drop `UnifiedPipelineStage.ETL`.
- operators/diff.py: refresh docstring examples that referenced the
  removed `etl_stage` task IDs.

* chore: enable pts_association_timeseries_view for non-PPP runs

* refactor: delete etl config

* chore(pts): propagating changes evidence_clinical_precedence config

* fix: revert testing output

* chore(uv): update lockfile

* fix(pts): literature config added

* fix(ot_crispr): study table is now exported in csv

* feat(pts): wire aact_extraction_batch_results into chembl_molecule

Mirror the PTS config change (opentargets/pts#142): the chembl_molecule step
now reads input/clinical_report/aact_extraction_batch_results to mine
clinical-trial (AACT) synonyms. That input is staged by pis_clinical_report, so
pts_chembl_molecule now also depends on it in the unified pipeline (otherwise
the DAG could run chembl_molecule before the AACT batch is present).

* refactor(pis): move aact batch download to drug step, drop the chembl_molecule dep

Mirror opentargets/pis#202: download aact_extraction_batch_results in the PIS
drug step (clinical_report glob split into top-level / aact / chembl subtrees to
exclude it). Since pts_chembl_molecule already depends on pis_drug, revert the
earlier pts_chembl_molecule -> pis_clinical_report edge — the DAG dependencies
stay as they were.

* refactor(pis): point aact glob at standalone source, drop clinical_report split

* refactor(pis): use copy_many for the aact batch download

* chore: bump pis_version to 26.06.0-dev.2 and pts_version to 26.06.0-dev.4

* chore: remove unnecessary flag

* chore: configuration updates

* chore(pts): migrate partition_count configs from pts repo

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* chore: update for new run

* fix(colocalisation): fixing gentropy tag (v3.3.0-dev.56) for cluster

* fix(clinical_target): remove 'UNVALIDATED_INDICATION' flag

Removed 'UNVALIDATED_INDICATION' from invalid clinical report QC settings.

* fix(credible_set): add `pts_target` as dependency for `isTransQtl` 

@DSuveges This change was uncommitted in Airflow. Can you confirm this is correct?

* chore(metrics): add pts_clinical_target as dependency

---------

Co-authored-by: Irene Lopez <irene.lopezs@protonmail.com>
Co-authored-by: David Ochoa <ochoa@ebi.ac.uk>
Co-authored-by: root <root@inst-builder-debian-11-build-build-8rm9w.europe-west4-b.c.gce-image-builder.internal>
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-authored-by: Irene López Santiago <45119610+ireneisdoomed@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants