Skip to content

Review request: new stimulus_bath shape + new concentration composite type #46

@stevevanhooser

Description

@stevevanhooser

Reviewers: andrea@walthamdatascience.com, jess@walthamdatascience.com

Companion to #45 (position_metadata / distance_metadata). The Dab corpus surfaces 1605 v1 stimulus_bath documents whose v1 shape doesn't fit the previous V_delta draft. While redesigning, V_delta gains a new named composite type concentration (the only multi-canonical SI composite — see the design note at the end). Filing this for your review before the changes settle.

Live schema on the branch:

Matching did-matlab migrator on claude/did-matlab-v2-import-Rs8AX:

  • src/did/+did2/+convert/+migrators/stimulus_bath.m
  • src/did/+did2/+schema/cache.m (validator switch accepts concentration)

stimulus_bath

Old schema (single solution name + scalar concentration)

super: [base]
depends_on:
* element_id

fields:
* solution_name        char    (REQ)   — name of the bath solution (e.g. "ACSF", "TTX")
  concentration        double          — concentration of the active compound
  concentration_units  char            — units (e.g. "mM", "uM")

v1 (Dab) actual shape

super: [base, epochid]
depends_on: [stimulus_element_id]

property block:
  location:                              { ontologyNode, name }
  mixture_table (CSV string):
     "ontologyName,name,value,ontologyUnit,unitName\n
      <chemical_curie>,<chemical_name>,<value>,<unit_curie>,<unit_name>\n
      ... (one row per chemical, 1-50 chemicals in this corpus) ..."

epochid block:
  epochid: "epoch_<...>"

Row count distribution across the 1605 docs: 1041 have 10 chemicals, 291 have 1, 142 have 2, 131 have 11, etc.

New schema (faithful to v1; uses the new concentration composite)

super: [base, epochid]                  ← epochid restored to match v1
depends_on:
* stimulus_element_id                   ← renamed from element_id to match v1

fields:
* location  (ontology_term, REQ)        — the bath
                                          (e.g. NCIm:C0179246, "Baths, Water, Laboratory")
  mixture   (structure, !mustBeScalar)  — array of per-chemical records:
                chemical (ontology_term, REQ)   — chemical species
                amount   (concentration, opt)   — concentration record
                                                  (see composite below)

Conversion rules

did_v1 location V_delta location Transformation
stimulus_bath.location.ontologyNode stimulus_bath.location.node inner rename (ontologyNode → node); name passes through
stimulus_bath.mixture_table (CSV string) stimulus_bath.mixture(i).chemical, .amount parse CSV; one record per data row; skip the header row that starts with ontologyName
CSV ontologyName column mixture(i).chemical.node verbatim
CSV name column mixture(i).chemical.name verbatim
CSV value column mixture(i).amount.source_value (always) + mixture(i).amount.molar (if unitName ∈ molar-family) + .grams_per_liter (if mass/volume-family) + ... depends on unitName; see composite below
CSV unitName column mixture(i).amount.source_unit verbatim
CSV ontologyUnit column (dropped) not represented in V_delta; the unit CURIE is conveyed via source_unit text. Could be added back as amount.source_unit_node if needed.
stimulus_bath.solution_name / concentration / concentration_units (old V_delta draft) (removed) v1 has none of these; the chemicals come from the CSV instead
(no v1 source) depends_on[name="stimulus_element_id"] already present in v1
v1 superclass epochid V_delta superclass epochid was missing in the old draft; restored

Worked v1 → V_delta example

v1 (Dab):

"stimulus_bath": {
  "location": {
    "ontologyNode": "NCIm:C0179246",
    "name":         "Baths, Water, Laboratory"
  },
  "mixture_table":
    "ontologyName,name,value,ontologyUnit,unitName\n
     NCIm:C1098706,arginine-vasopressin,2e-07,OM:MolarVolumeUnit,Molar\n"
}

V_delta after migration:

"stimulus_bath": {
  "location": {
    "node": "NCIm:C0179246",
    "name": "Baths, Water, Laboratory"
  },
  "mixture": [
    {
      "chemical": {"node": "NCIm:C1098706", "name": "arginine-vasopressin"},
      "amount": {
        "molar":        2e-7,
        "approximate":  false,
        "source_unit":  "Molar",
        "source_value": 2e-7
      }
    }
  ]
}

New composite type: concentration

The other SI composites (duration, volume, mass, length, voltage, current, frequency) all share one single canonical sub-field name (e.g., length -> meters) plus approximate / source_unit / source_value. The migrator converts the source value to the canonical unit using a fixed scale factor.

Concentration breaks that pattern: you can't convert mass-per-volume to molar without molecular weight, and vice versa. Forcing a single canonical would mean every concentration that doesn't ship its MW becomes uninterpretable.

So concentration has multiple optional canonical sub-fields, and the migrator populates whichever the source unit is computable into:

concentration:
  molar           (double, optional)    — mol/L
  grams_per_liter (double, optional)    — mass/volume
  mass_fraction   (double, optional)    — w/w (dimensionless 0-1)
  volume_fraction (double, optional)    — v/v (dimensionless 0-1)
  approximate     (boolean)             — same as other composites
  source_unit     (char)                — verbatim source unit text
  source_value    (double)              — verbatim source value

Source-unit → canonical mapping (in the did-matlab migrator)

source_unit family populates scale
Molar / M / mol/L molar ×1
Millimolar / mM molar ×1e-3
Micromolar / uM / mumolar molar ×1e-6
Nanomolar / nM molar ×1e-9
Picomolar / pM molar ×1e-12
g/L / mg/mL grams_per_liter ×1
mg/L / ug/mL grams_per_liter ×1e-3
ug/L grams_per_liter ×1e-6
w/w mass_fraction identity
v/v volume_fraction identity
(unknown) (none) source_* only

Unknown source units leave every canonical sub-field absent but still preserve source_unit / source_value so consumers retain the raw value and can compute canonicals later when the table grows.

Did_schema_meta.json change

  • Added "concentration" to the type enum at line 213 area.
  • Extended the top-level description to call out the multi-canonical exception.

Did-matlab validator change

  • did2.schema.cache.validateDocument switch case for composite types adds 'concentration'. Same shape check as the other composites: must be a struct.

Design choices to push back on if you disagree

  1. Multi-canonical-fields vs. discriminator vs. multiple composite types. Picked multi-canonical because it makes queries like "find baths with molar concentration > X" a one-shot test on mixture[*].amount.molar IS NOT NULL AND mixture[*].amount.molar > X instead of branching on a kind field. Downside: schema is slightly busier than a discriminator-style composite.
  2. Mixture as array-of-records named mixture (not mixture_table, solutes, or bath_contents). Open to renaming.
  3. amount field name for the per-chemical concentration (not concentration, since the field type is itself concentration). Avoids amount.concentration.molar-style stutter.
  4. epochid restored as a superclass. v1 carried it; the previous V_delta draft dropped it. This means migrated v2 docs will validate against the epochid block (already present in v1).
  5. ontologyUnit CURIE column dropped during migration. v1 had both ontologyUnit (CURIE) and unitName (text). V_delta keeps only the text via source_unit. An obvious extension is amount.source_unit_node (CURIE) if the ontology classification of the unit matters downstream.

Corpus impact (Python simulator after this redesign + migrator):

corpus total migrated quarantined
PRED 14 14 0
20211116 1220 1220 0
B 12917 12917 0
JH 78688 78688 0
Dab 27561 27561 0 (expected; was 1605 stimulus_bath quarantined)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions