Skip to content

fix(clickhouse): drug synonyms/tradeNames are {label, source} struct arrays#130

Merged
remo87 merged 1 commit into
mainfrom
drug-molecule-synonyms-struct
Jun 17, 2026
Merged

fix(clickhouse): drug synonyms/tradeNames are {label, source} struct arrays#130
remo87 merged 1 commit into
mainfrom
drug-molecule-synonyms-struct

Conversation

@d0choa

@d0choa d0choa commented Jun 11, 2026

Copy link
Copy Markdown
Contributor

Summary

drug_molecule's synonyms and tradeNames fields changed from array<string> to array<struct<label, source>> (opentargets/pts#142, tracking issue opentargets/issues#4414) — each name now carries provenance (ChEMBL, or AACT for names mined from clinical trials).

POS ingests output/drug_molecule directly into the drug_log ClickHouse table via config/clickhouse/schema/drug.sql, so the column DDL must match the parquet. Loading the new struct arrays into Array(String) columns would fail.

What changed

config/clickhouse/schema/drug.sqlsynonyms and tradeNames change from Array(String) to Array(Tuple(label String, source String)), mirroring the existing crossReferences tuple. The final drug table (scripts/drug.sql, SELECT *) inherits the new types, so nothing else changes.

Scope (verified)

This is the only POS consequence:

  • OpenSearch — there's no drug_molecule index; the drug dataset goes only to ClickHouse. search_drug is built from the flattened search dataset (.label strings, schema unchanged); indication doesn't carry these fields.
  • BigQuery / release — parquet is copied / auto-detected; no explicit drug schema references these fields.
  • No src/ Python or other ClickHouse script references drug synonyms/tradeNames.

ClickHouse matches Tuple elements to the parquet struct positionally; the parquet order is {label, source} (label first), which the DDL matches — consistent with the existing crossReferences tuple.

…arrays

drug_molecule's `synonyms` and `tradeNames` changed from array<string> to
array<struct<label, source>> (opentargets/pts#142, tracking opentargets/issues#4414).
The drug_log ClickHouse table ingests output/drug_molecule directly, so its
column DDL must match the parquet — loading the struct arrays into Array(String)
columns would fail.

Update both to Array(Tuple(label String, source String)), mirroring the existing
crossReferences tuple. The final drug table (postload SELECT *) inherits the new
types, so no other change is needed.
@d0choa d0choa requested a review from remo87 June 11, 2026 11:14
@remo87 remo87 merged commit 49c599f into main Jun 17, 2026
remo87 pushed a commit to opentargets/platform-api that referenced this pull request Jun 17, 2026
drug_molecule's `synonyms` and `tradeNames` changed from array<string> to
array<struct<label, source>> (opentargets/pts#142, tracking opentargets/issues#4414),
and the POS ClickHouse drug table now stores Array(Tuple(label, source))
(opentargets/pos#130).

Read both as Seq[LabelAndSource] (the existing type Target already uses) — the
ClickHouse JSON read path parses the named tuple as a {label, source} object, the
same way crossReferences (Tuple(source, ids)) already works. The Drug GraphQL type
is derived, so the field type auto-updates to [LabelAndSource]; the synonyms/
tradeNames field docs are updated to describe the provenance.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants