feat(chembl_molecule): labelled synonyms/tradeNames + AACT clinical-trial synonym feed by d0choa · Pull Request #142 · opentargets/pts

d0choa · 2026-06-10T14:58:43Z

Tracking issue: opentargets/issues#4414

Summary

Adds source provenance to drug synonyms and a second synonym source mined from clinical trials.

Schema change — chembl_molecule's synonyms and tradeNames change from array<string> to array<struct<label, source>> (mirroring output/target). Every existing ChEMBL name is tagged source: "ChEMBL". The struct propagates through drug_molecule (pure passthrough) to the final output/drug_molecule.
AACT clinical-trial synonym feed — a PySpark port (inline in chembl_molecule.py) of the prior work/clinical_pairs/ experiment, integrated as a real pipeline source:
- parse the OpenAI batch JSONL (aact_extraction_batch_results) → normalized drug member sets per trial
- anchor members against an in-step ChEMBL name index (with a ≤10 ambiguity cap)
- apply the experiment's 11 cleanup rules (control noise, drug-class/cell-therapy via word-boundary matching so real drugs like nystatin/cellcept survive, strength/%, single-char, regimen suppression, descriptor-code extraction, plural suppression) and NOVEL/PARENT_CHILD/CONFLICT status
- keep candidates seen in ≥ 2 distinct trials, append as {label, source: "AACT"}, deduped case-insensitively against existing ChEMBL labels
- the name fallback only ever picks a ChEMBL-source synonym, so an AACT label never becomes a molecule's display name
Downstream — search.py flattens synonyms.label/tradeNames.label; openfda.py updated to extract .label before flattening (it consumes drug_molecule and would otherwise break on the struct schema; its stale array<string> test fixtures were also updated).
Config — aact_extraction_batch_results: input/clinical_report/aact_extraction_batch_results wired into the chembl_molecule step (the same PIS-provided input clinical_report already uses; upstream of everything, no dependency cycle).

End state: many molecules carry synonyms from two sources (ChEMBL + AACT).

Design & plan live in the Obsidian vault: Areas/Work/Projects/Project-PTS/specs/2026-06-10-chembl-molecule-labelled-synonyms-aact-design.md (+ -plan.md).

Test Plan

make test — full suite green (396 passed)
uv run ruff check — clean
Unit coverage: struct schema, batch parse (incl. malformed-line drop), anchor indexes, anchoring + status, all cleanup rules (incl. word-boundary guard), n_trials gate, case-insensitive merge dedup, end-to-end two-source molecule, openfda struct fix
End-to-end Dataproc run (run-001, il/26.06.0-dev0, ~5 min): 2,348 molecules with both ChEMBL + AACT synonyms, 8,038 (id, label) pairs at n_trials ≥ 2, 0 non-ChEMBL tradeNames — see the validation comment below for the full table
Scale check: anchoring ran fine on a small 2-worker cluster; embedded-newline AACT object names were a non-issue for Spark's GCS connector

Notes

This integrates the clinical-trial synonym feed only. The earlier Probes&Drugs synonym-expansion effort is intentionally out of scope.
Stored AACT labels are the normalized form in v1; a "most-frequent surface form" refinement is a noted follow-up.

…urce

_prepare_drug_list now projects synonyms.label / tradeNames.label so all inputs to the array() flatten are array<string>, matching the updated chembl_molecule/drug_molecule struct schema. Test fixtures updated to the struct schema to catch this class of regression in future.

…nyms module Move the clinical-trial synonym mining out of chembl_molecule.py into a dedicated aact_synonyms.py, with the shared {label, source} struct primitives in common/labels.py (avoids an import cycle). chembl_molecule.py drops from 731 to 362 lines; the mining tests move to test_aact_synonyms.py. Also fixes the entry-id determinism in _anchor_candidates: the anchoring used f.monotonically_increasing_id(), which is nondeterministic across re-evaluations and could let the poisoned/anchors branches see inconsistent ids under Spark adaptive re-planning. Replaced with a deterministic sha2 key over (nct_id + sorted member set). No behaviour change in tests; matters at scale.

Move labels.py and aact_synonyms.py into src/pts/pyspark/drug_utils/ (out of the pyspark root and common/) so the drug-specific helpers live together. Imports in chembl_molecule and the tests updated accordingly. No behaviour change.

d0choa · 2026-06-10T17:45:30Z

✅ Validated end-to-end on Dataproc (run-001)

Ran the chembl_molecule step on real il/26.06.0-dev0 inputs (drug JSONL + drugbank + the AACT clinical-trial batch extraction). Single step, 1× n1-standard-4 master + 2× n1-highmem-8 workers, ~5 min job. Output: gs://ot-team/ochoa/chembl_molecule_runs/run-001/intermediate/chembl_molecule.

Schema — both columns now array<struct<label:string, source:string>>.

metric	value
total molecules	2,878,135
molecules with an AACT synonym	2,393
molecules with BOTH ChEMBL + AACT	2,348
distinct `(id, AACT label)` pairs (`n_trials ≥ 2`)	8,038
tradeNames with a non-ChEMBL source	0 ✓

The "many molecules with synonyms from two sources" goal is met (2,348). The 8,038 pairs are the same order of magnitude as the experiment's design target (~few thousand at n_trials ≥ 2), slightly higher — consistent with CONFLICT cross-pollination (an abbreviation anchoring to more than one molecule) and a different release than the original experiment.

Spot-check of mined synonyms (all correct):

ATAZANAVIR → atv, atz, boosted atazanavir, atazanavir/ritonavir
TANDUTINIB → mln518 · SABARUBICIN → bms-195615, men-10755 (R&D codes — descriptor-code extraction working)
FOLINIC ACID → leucovorin, lv, calcium folinate, folfiri
CARBIDOPA → carbidopa/levodopa, sinemet cr, lcig · AVIBACTAM → ceftazidime-avibactam, caz-avi

Operational notes:

The embedded-newline AACT batch object names were a non-issue for Spark's GCS connector.
The deterministic sha2 entry key (replacing monotonically_increasing_id) ran in production without issue.

Codex review caught a real gap: rule #8 rewrote a descriptor phrase (e.g. "akt inhibitor mk2206") to its bare code AFTER anchoring/status classification, so the rewritten code carried a stale status. A code that is actually a parent/child of the anchor (canonical vs salt) or already on the anchor slipped through as NOVEL and could be added. Move code extraction into a dedicated _rewrite_and_reclassify_codes stage (between anchoring and cleanup) that re-resolves the rewritten code against the ChEMBL name index: drops it if it is now on the anchor or newly over-ambiguous, and recomputes NOVEL/PARENT_CHILD/CONFLICT so parent/child codes are dropped downstream. _apply_cleanup_rules no longer does extraction. CONFLICT is still kept (per the approved spec). New tests cover the redundant/parent-child/conflict reclassification paths.

d0choa · 2026-06-10T19:32:37Z

Codex review (gpt-5.5, high effort) — outcome

Ran a Codex review of the branch. No Critical findings; it confirmed no remaining array<string> consumer of these fields. Two substantive items, both addressed:

Fixed — descriptor-code reclassification

Codex correctly flagged that rule #8 rewrote a descriptor phrase to its bare R&D code (akt inhibitor mk2206 → mk2206) after anchoring/status classification, leaving a stale status — so a code that's actually a parent/child of the anchor (canonical vs salt) or already on the anchor could slip through as NOVEL.

Fixed in fix(aact): re-resolve descriptor-extracted codes before the trial gate: code extraction now lives in a dedicated _rewrite_and_reclassify_codes stage between anchoring and cleanup that re-resolves the rewritten code, drops it if it's now on the anchor or newly over-ambiguous, and recomputes NOVEL/PARENT_CHILD/CONFLICT. New unit tests cover the redundant / parent-child / conflict paths. On the real data (run-002) this changed no output counts — it's a correctness safeguard that costs no coverage.

Investigated — keeping CONFLICT (no change)

Codex questioned keeping CONFLICT candidates (cross-molecule labels). The spec deliberately keeps them; I quantified it against run-002:

	count	share
total AACT `(id, label)` pairs	8,038
NOVEL-like (label nowhere else in ChEMBL)	7,040	87.6%
CONFLICT-like (label is a ChEMBL name elsewhere)	998	12.4%
…combination-looking (`/`, "with", "plus")	3	0.04%

The CONFLICT-like cases are overwhelmingly same-drug-family variants ChEMBL models as separate molecules without a parentId link — abiraterone ↔ abiraterone acetate, adefovir ↔ adefovir dipivoxil, adalimumab across biosimilar entries, 5-fu under a 5-FU prodrug, stereoisomers. These are real synonyms; genuine cross-drug pollution is rare (3 obvious co-administration cases out of 8,038).

Decision: keep CONFLICT as-is. A higher MIN_CONFLICT_TRIALS would drop legitimate salt/prodrug/biosimilar synonyms (the bulk of the 998) to remove a handful of junk — net-negative. The data validates the spec's original "considered-and-rejected" reasoning.

DSuveges

Everything seems sensible to me.

The AACT clinical-trial batch extraction feeds chembl_molecule (opentargets/pts#142, tracking opentargets/issues#4414). It lives in its own standalone source (gs://ot-team/irene/clinical_mining/aact_extraction_batch_result/output), so the drug step downloads it to input/clinical_report/aact_extraction_batch_results. chembl_molecule already depends on pis_drug, so this needs no pts_chembl_molecule -> pis_clinical_report DAG edge. The clinical_report step is unchanged (the batch is no longer nested under its input tree).

The AACT clinical-trial batch extraction feeds chembl_molecule (opentargets/pts#142, tracking opentargets/issues#4414). It lives in its own standalone source (gs://ot-team/irene/clinical_mining/aact_extraction_batch_result/output), so the drug step copies it (via copy_many) to input/clinical_report/aact_extraction_batch_results. chembl_molecule already depends on pis_drug, so this needs no pts_chembl_molecule -> pis_clinical_report DAG edge. The clinical_report step is unchanged (the batch is no longer nested under its input tree).

…ields (#47) drug_molecule's `synonyms` and `tradeNames` changed from array<string> to array<struct<label, source>> (opentargets/pts#142, tracking opentargets/issues#4414). The Parquet introspection picks up the new subfields automatically, but they had no curation, so they would publish with PLACEHOLDER descriptions. Add curation entries for synonyms/label, synonyms/source, tradeNames/label and tradeNames/source, mirroring the existing crossReferences/ids + /source pattern.

Mirror the PTS config change (opentargets/pts#142): the chembl_molecule step now reads input/clinical_report/aact_extraction_batch_results to mine clinical-trial (AACT) synonyms. That input is staged by pis_clinical_report, so pts_chembl_molecule now also depends on it in the unified pipeline (otherwise the DAG could run chembl_molecule before the AACT batch is present).

ireneisdoomed

Impact on latest clinical mining data: + 3048 drug/disease pairs fully mapped (240 new drugs, 943 diseases)

Great addition, thanks :)

…202) * feat(drug): download aact_extraction_batch_results in the drug step The AACT clinical-trial batch extraction feeds chembl_molecule (opentargets/pts#142, tracking opentargets/issues#4414). It lives in its own standalone source (gs://ot-team/irene/clinical_mining/aact_extraction_batch_result/output), so the drug step copies it (via copy_many) to input/clinical_report/aact_extraction_batch_results. chembl_molecule already depends on pis_drug, so this needs no pts_chembl_molecule -> pis_clinical_report DAG edge. The clinical_report step is unchanged (the batch is no longer nested under its input tree). * chore: bump version to 26.06.6

Mirror the PTS config change (opentargets/pts#142): the chembl_molecule step now reads input/clinical_report/aact_extraction_batch_results to mine clinical-trial (AACT) synonyms. That input is staged by pis_clinical_report, so pts_chembl_molecule now also depends on it in the unified pipeline (otherwise the DAG could run chembl_molecule before the AACT batch is present).

…arrays drug_molecule's `synonyms` and `tradeNames` changed from array<string> to array<struct<label, source>> (opentargets/pts#142, tracking opentargets/issues#4414). The drug_log ClickHouse table ingests output/drug_molecule directly, so its column DDL must match the parquet — loading the struct arrays into Array(String) columns would fail. Update both to Array(Tuple(label String, source String)), mirroring the existing crossReferences tuple. The final drug table (postload SELECT *) inherits the new types, so no other change is needed.

drug_molecule's `synonyms` and `tradeNames` changed from array<string> to array<struct<label, source>> (opentargets/pts#142, tracking opentargets/issues#4414), and the POS ClickHouse drug table now stores Array(Tuple(label, source)) (opentargets/pos#130). Read both as Seq[LabelAndSource] (the existing type Target already uses) — the ClickHouse JSON read path parses the named tuple as a {label, source} object, the same way crossReferences (Tuple(source, ids)) already works. The Drug GraphQL type is derived, so the field type auto-updates to [LabelAndSource]; the synonyms/ tradeNames field docs are updated to describe the provenance.

@DSuveges

* chore: update data and software versions * chore: add clinical_report llm dep * chore: update pis paths * fix: split ontoma into two steps to avoid circular issue * fix: gentropy version typo * chore: download essentiality from depmap directly * fix: point `pts_literature_publication_match` to `pts_ontoma_literature` * revert: essentiality task cannot pull from depmap * fix: update openfda config * fix: add missing target dep * fix: add missing dep for search_facet * fix: add string_version as a pts env variable * fix: remove qc flags from drug_molecule * fix: typo * fix: update essentiality filename * fix: add pts_target to pts_evidence_postprocess_clinical_precedence deps * chore: update pts * fix: split openfda subtasks into independent tasks (spark job goes idle) * fix: baseline_expression step only triggers a single spark job * fix: add pis_heritability to unified dag * fix: update score expression for some sources * chore: rename pts_ontoma_literature to run on literature cluster * perf: improve pts cluster settings to allow parallel jobs * perf: improve pts cluster settings to allow parallel jobs * fix: typo * fix(gentropy): add interactions * fix: baseline expression path typo * chore: avoid preemptible secondary workers in literature cluster * revert: baseline expression path typo * chore: uncomment metrics * fix(epmc): evidence format is parquet * chore: rename baseline_expression_aggregated to baseline_expression * chore(l2g): set `train_on_full_dataset` to false * chore: update pts to check target fix * chore: update pts to check target fix * chore(pis): updating PanelApp data source for 2026.05.11 release - The file has the same schema and identical format - The new release has 33k fewer lines, which might indicate we are not getting all ratings. It might not impact the number of evidence and associations at the end. * chore: bump chembl_version to 37 * chore: retire ETL stage from unified pipeline The ETL stage in the unified pipeline DAG has had zero step consumers since PR #195. Remove the now-orphan configuration, loaders, DAG stage function, and supporting operator/enum entries: - clusters.yaml: drop the `etl` and `etl_literature` clusters plus the `step_job_properties.etl` block. - unified_pipeline.yaml: drop `etl_version` and the `etl_literature` step entry. - etl.conf: deleted (no longer loaded). - config/unified_pipeline.py: drop the `etl` AppConfig loader (and its PPP overlay), `etl_version`/`etl_jar_origin_uri`, the now-unused `jar_uri()` helper, and the `exts` map in `config_uri()`. - dags/unified_pipeline.py: drop the `etl_stage()` function, its call, and the imports that only it used (`ETLJobBuilder`, `CopyBlobOperator`, `to_hocon`). - operators/dataproc.py: delete `ETLJobBuilder`. - models/step.py: drop `UnifiedPipelineStage.ETL`. - operators/diff.py: refresh docstring examples that referenced the removed `etl_stage` task IDs. * chore: enable pts_association_timeseries_view for non-PPP runs * refactor: delete etl config * chore(pts): propagating changes evidence_clinical_precedence config * fix: revert testing output * chore(uv): update lockfile * fix(pts): literature config added * fix(ot_crispr): study table is now exported in csv * feat(pts): wire aact_extraction_batch_results into chembl_molecule Mirror the PTS config change (opentargets/pts#142): the chembl_molecule step now reads input/clinical_report/aact_extraction_batch_results to mine clinical-trial (AACT) synonyms. That input is staged by pis_clinical_report, so pts_chembl_molecule now also depends on it in the unified pipeline (otherwise the DAG could run chembl_molecule before the AACT batch is present). * refactor(pis): move aact batch download to drug step, drop the chembl_molecule dep Mirror opentargets/pis#202: download aact_extraction_batch_results in the PIS drug step (clinical_report glob split into top-level / aact / chembl subtrees to exclude it). Since pts_chembl_molecule already depends on pis_drug, revert the earlier pts_chembl_molecule -> pis_clinical_report edge — the DAG dependencies stay as they were. * refactor(pis): point aact glob at standalone source, drop clinical_report split * refactor(pis): use copy_many for the aact batch download * chore: bump pis_version to 26.06.0-dev.2 and pts_version to 26.06.0-dev.4 * chore: remove unnecessary flag * chore: configuration updates * chore(pts): migrate partition_count configs from pts repo Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * chore: update for new run * fix(colocalisation): fixing gentropy tag (v3.3.0-dev.56) for cluster * fix(clinical_target): remove 'UNVALIDATED_INDICATION' flag Removed 'UNVALIDATED_INDICATION' from invalid clinical report QC settings. * fix(credible_set): add `pts_target` as dependency for `isTransQtl` @DSuveges This change was uncommitted in Airflow. Can you confirm this is correct? * chore(metrics): add pts_clinical_target as dependency --------- Co-authored-by: Irene Lopez <irene.lopezs@protonmail.com> Co-authored-by: David Ochoa <ochoa@ebi.ac.uk> Co-authored-by: root <root@inst-builder-debian-11-build-build-8rm9w.europe-west4-b.c.gce-image-builder.internal> Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com> Co-authored-by: Irene López Santiago <45119610+ireneisdoomed@users.noreply.github.com>

d0choa added 14 commits June 10, 2026 13:46

feat(chembl_molecule): label/source structs for synonyms and tradeNames

0d5aa93

feat(search): flatten synonyms/tradeNames .label from struct schema

5b534bd

test(drug_molecule): struct schema for tradeNames/synonyms fixtures

4d4ef00

feat(chembl_molecule): name normalization helper for AACT mining

3293977

feat(chembl_molecule): parse AACT batch JSONL into drug member sets

6e23432

feat(chembl_molecule): ChEMBL name/regimen/parent-child anchor indexes

0ccfcb3

feat(chembl_molecule): anchor AACT candidates with status classification

869fa30

feat(chembl_molecule): AACT candidate cleanup rules (#5-#11)

3d0965d

feat(chembl_molecule): mine AACT synonyms with n_trials>=2 gate

e02a649

feat(chembl_molecule): merge AACT synonyms into output, wire batch so…

c6cef04

…urce

feat(chembl_molecule): wire AACT extraction batch source

c95f64c

d0choa mentioned this pull request Jun 10, 2026

Enhance chembl_molecule synonyms with clinical-trial (AACT) extracted drug names opentargets/issues#4414

Closed

d0choa marked this pull request as ready for review June 10, 2026 19:39

d0choa requested a review from ireneisdoomed June 10, 2026 19:39

This was referenced Jun 10, 2026

feat(drug): {label,source} struct schema + ChEMBL-over-AACT synonym prioritisation opentargets/OnToma#56

Merged

feat(pts): wire aact_extraction_batch_results into chembl_molecule (via pis_drug) opentargets/orchestration#207

Merged

DSuveges reviewed Jun 11, 2026

View reviewed changes

Comment thread src/pts/pyspark/drug_utils/aact_synonyms.py

DSuveges previously approved these changes Jun 11, 2026

View reviewed changes

d0choa mentioned this pull request Jun 11, 2026

feat(drug): download aact_extraction_batch_results in the drug step opentargets/pis#202

Merged

d0choa mentioned this pull request Jun 11, 2026

feat: document drug_molecule synonyms/tradeNames {label, source} subfields opentargets/ot_croissant#47

Merged

1 task

chore(deps): bump ontoma to >=2.5.0 for the {label,source} drug schema

6b03a31

d0choa dismissed DSuveges’s stale review via 6b03a31 June 11, 2026 10:49

d0choa mentioned this pull request Jun 11, 2026

fix(clickhouse): drug synonyms/tradeNames are {label, source} struct arrays opentargets/pos#130

Merged

Merge branch 'main' into chembl-labelled-synonyms-aact

71dd5eb

d0choa mentioned this pull request Jun 11, 2026

feat(drug): expose synonyms/tradeNames as {label, source} opentargets/platform-api#349

Merged

ireneisdoomed previously approved these changes Jun 11, 2026

View reviewed changes

chore(deps): bump ontoma to >=2.5.1

c13cccc

d0choa dismissed ireneisdoomed’s stale review via c13cccc June 11, 2026 16:30

d0choa merged commit d8280c3 into main Jun 11, 2026
2 checks passed

d0choa deleted the chembl-labelled-synonyms-aact branch June 11, 2026 16:41

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(chembl_molecule): labelled synonyms/tradeNames + AACT clinical-trial synonym feed#142

feat(chembl_molecule): labelled synonyms/tradeNames + AACT clinical-trial synonym feed#142
d0choa merged 18 commits into
mainfrom
chembl-labelled-synonyms-aact

d0choa commented Jun 10, 2026 •

edited

Loading

Uh oh!

d0choa commented Jun 10, 2026

Uh oh!

d0choa commented Jun 10, 2026

Uh oh!

Uh oh!

DSuveges left a comment

Uh oh!

ireneisdoomed left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

d0choa commented Jun 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test Plan

Notes

Uh oh!

d0choa commented Jun 10, 2026

✅ Validated end-to-end on Dataproc (run-001)

Uh oh!

d0choa commented Jun 10, 2026

Codex review (gpt-5.5, high effort) — outcome

Fixed — descriptor-code reclassification

Investigated — keeping CONFLICT (no change)

Uh oh!

Uh oh!

DSuveges left a comment

Choose a reason for hiding this comment

Uh oh!

ireneisdoomed left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

d0choa commented Jun 10, 2026 •

edited

Loading