Reviewers: andrea@walthamdatascience.com, jess@walthamdatascience.com
Companion to #45 (position_metadata / distance_metadata). The Dab corpus surfaces 1605 v1 stimulus_bath documents whose v1 shape doesn't fit the previous V_delta draft. While redesigning, V_delta gains a new named composite type concentration (the only multi-canonical SI composite — see the design note at the end). Filing this for your review before the changes settle.
Live schema on the branch:
Matching did-matlab migrator on claude/did-matlab-v2-import-Rs8AX:
src/did/+did2/+convert/+migrators/stimulus_bath.m
src/did/+did2/+schema/cache.m (validator switch accepts concentration)
stimulus_bath
Old schema (single solution name + scalar concentration)
super: [base]
depends_on:
* element_id
fields:
* solution_name char (REQ) — name of the bath solution (e.g. "ACSF", "TTX")
concentration double — concentration of the active compound
concentration_units char — units (e.g. "mM", "uM")
v1 (Dab) actual shape
super: [base, epochid]
depends_on: [stimulus_element_id]
property block:
location: { ontologyNode, name }
mixture_table (CSV string):
"ontologyName,name,value,ontologyUnit,unitName\n
<chemical_curie>,<chemical_name>,<value>,<unit_curie>,<unit_name>\n
... (one row per chemical, 1-50 chemicals in this corpus) ..."
epochid block:
epochid: "epoch_<...>"
Row count distribution across the 1605 docs: 1041 have 10 chemicals, 291 have 1, 142 have 2, 131 have 11, etc.
New schema (faithful to v1; uses the new concentration composite)
super: [base, epochid] ← epochid restored to match v1
depends_on:
* stimulus_element_id ← renamed from element_id to match v1
fields:
* location (ontology_term, REQ) — the bath
(e.g. NCIm:C0179246, "Baths, Water, Laboratory")
mixture (structure, !mustBeScalar) — array of per-chemical records:
chemical (ontology_term, REQ) — chemical species
amount (concentration, opt) — concentration record
(see composite below)
Conversion rules
| did_v1 location |
V_delta location |
Transformation |
stimulus_bath.location.ontologyNode |
stimulus_bath.location.node |
inner rename (ontologyNode → node); name passes through |
stimulus_bath.mixture_table (CSV string) |
stimulus_bath.mixture(i).chemical, .amount |
parse CSV; one record per data row; skip the header row that starts with ontologyName |
CSV ontologyName column |
mixture(i).chemical.node |
verbatim |
CSV name column |
mixture(i).chemical.name |
verbatim |
CSV value column |
mixture(i).amount.source_value (always) + mixture(i).amount.molar (if unitName ∈ molar-family) + .grams_per_liter (if mass/volume-family) + ... |
depends on unitName; see composite below |
CSV unitName column |
mixture(i).amount.source_unit |
verbatim |
CSV ontologyUnit column |
(dropped) |
not represented in V_delta; the unit CURIE is conveyed via source_unit text. Could be added back as amount.source_unit_node if needed. |
stimulus_bath.solution_name / concentration / concentration_units (old V_delta draft) |
(removed) |
v1 has none of these; the chemicals come from the CSV instead |
| (no v1 source) |
depends_on[name="stimulus_element_id"] |
already present in v1 |
v1 superclass epochid |
V_delta superclass epochid |
was missing in the old draft; restored |
Worked v1 → V_delta example
v1 (Dab):
"stimulus_bath": {
"location": {
"ontologyNode": "NCIm:C0179246",
"name": "Baths, Water, Laboratory"
},
"mixture_table":
"ontologyName,name,value,ontologyUnit,unitName\n
NCIm:C1098706,arginine-vasopressin,2e-07,OM:MolarVolumeUnit,Molar\n"
}
V_delta after migration:
"stimulus_bath": {
"location": {
"node": "NCIm:C0179246",
"name": "Baths, Water, Laboratory"
},
"mixture": [
{
"chemical": {"node": "NCIm:C1098706", "name": "arginine-vasopressin"},
"amount": {
"molar": 2e-7,
"approximate": false,
"source_unit": "Molar",
"source_value": 2e-7
}
}
]
}
New composite type: concentration
The other SI composites (duration, volume, mass, length, voltage, current, frequency) all share one single canonical sub-field name (e.g., length -> meters) plus approximate / source_unit / source_value. The migrator converts the source value to the canonical unit using a fixed scale factor.
Concentration breaks that pattern: you can't convert mass-per-volume to molar without molecular weight, and vice versa. Forcing a single canonical would mean every concentration that doesn't ship its MW becomes uninterpretable.
So concentration has multiple optional canonical sub-fields, and the migrator populates whichever the source unit is computable into:
concentration:
molar (double, optional) — mol/L
grams_per_liter (double, optional) — mass/volume
mass_fraction (double, optional) — w/w (dimensionless 0-1)
volume_fraction (double, optional) — v/v (dimensionless 0-1)
approximate (boolean) — same as other composites
source_unit (char) — verbatim source unit text
source_value (double) — verbatim source value
Source-unit → canonical mapping (in the did-matlab migrator)
| source_unit family |
populates |
scale |
| Molar / M / mol/L |
molar |
×1 |
| Millimolar / mM |
molar |
×1e-3 |
| Micromolar / uM / mumolar |
molar |
×1e-6 |
| Nanomolar / nM |
molar |
×1e-9 |
| Picomolar / pM |
molar |
×1e-12 |
| g/L / mg/mL |
grams_per_liter |
×1 |
| mg/L / ug/mL |
grams_per_liter |
×1e-3 |
| ug/L |
grams_per_liter |
×1e-6 |
| w/w |
mass_fraction |
identity |
| v/v |
volume_fraction |
identity |
| (unknown) |
(none) |
source_* only |
Unknown source units leave every canonical sub-field absent but still preserve source_unit / source_value so consumers retain the raw value and can compute canonicals later when the table grows.
Did_schema_meta.json change
- Added
"concentration" to the type enum at line 213 area.
- Extended the top-level description to call out the multi-canonical exception.
Did-matlab validator change
did2.schema.cache.validateDocument switch case for composite types adds 'concentration'. Same shape check as the other composites: must be a struct.
Design choices to push back on if you disagree
- Multi-canonical-fields vs. discriminator vs. multiple composite types. Picked multi-canonical because it makes queries like "find baths with molar concentration > X" a one-shot test on
mixture[*].amount.molar IS NOT NULL AND mixture[*].amount.molar > X instead of branching on a kind field. Downside: schema is slightly busier than a discriminator-style composite.
- Mixture as array-of-records named
mixture (not mixture_table, solutes, or bath_contents). Open to renaming.
amount field name for the per-chemical concentration (not concentration, since the field type is itself concentration). Avoids amount.concentration.molar-style stutter.
epochid restored as a superclass. v1 carried it; the previous V_delta draft dropped it. This means migrated v2 docs will validate against the epochid block (already present in v1).
ontologyUnit CURIE column dropped during migration. v1 had both ontologyUnit (CURIE) and unitName (text). V_delta keeps only the text via source_unit. An obvious extension is amount.source_unit_node (CURIE) if the ontology classification of the unit matters downstream.
Corpus impact (Python simulator after this redesign + migrator):
| corpus |
total |
migrated |
quarantined |
| PRED |
14 |
14 |
0 |
| 20211116 |
1220 |
1220 |
0 |
| B |
12917 |
12917 |
0 |
| JH |
78688 |
78688 |
0 |
| Dab |
27561 |
27561 |
0 (expected; was 1605 stimulus_bath quarantined) |
Reviewers: andrea@walthamdatascience.com, jess@walthamdatascience.com
Companion to #45 (position_metadata / distance_metadata). The Dab corpus surfaces 1605 v1
stimulus_bathdocuments whose v1 shape doesn't fit the previous V_delta draft. While redesigning, V_delta gains a new named composite typeconcentration(the only multi-canonical SI composite — see the design note at the end). Filing this for your review before the changes settle.Live schema on the branch:
stable/stimulus_bath.jsonstable/did_schema_meta.json(addsconcentrationto the type enum)Matching did-matlab migrator on
claude/did-matlab-v2-import-Rs8AX:src/did/+did2/+convert/+migrators/stimulus_bath.msrc/did/+did2/+schema/cache.m(validator switch acceptsconcentration)stimulus_bathOld schema (single solution name + scalar concentration)
v1 (Dab) actual shape
Row count distribution across the 1605 docs: 1041 have 10 chemicals, 291 have 1, 142 have 2, 131 have 11, etc.
New schema (faithful to v1; uses the new
concentrationcomposite)Conversion rules
stimulus_bath.location.ontologyNodestimulus_bath.location.nodeontologyNode → node); name passes throughstimulus_bath.mixture_table(CSV string)stimulus_bath.mixture(i).chemical,.amountontologyNameontologyNamecolumnmixture(i).chemical.nodenamecolumnmixture(i).chemical.namevaluecolumnmixture(i).amount.source_value(always) +mixture(i).amount.molar(ifunitName∈ molar-family) +.grams_per_liter(if mass/volume-family) + ...unitName; see composite belowunitNamecolumnmixture(i).amount.source_unitontologyUnitcolumnsource_unittext. Could be added back asamount.source_unit_nodeif needed.stimulus_bath.solution_name/concentration/concentration_units(old V_delta draft)depends_on[name="stimulus_element_id"]epochidepochidWorked v1 → V_delta example
v1 (Dab):
V_delta after migration:
New composite type:
concentrationThe other SI composites (
duration,volume,mass,length,voltage,current,frequency) all share one single canonical sub-field name (e.g.,length->meters) plusapproximate/source_unit/source_value. The migrator converts the source value to the canonical unit using a fixed scale factor.Concentration breaks that pattern: you can't convert mass-per-volume to molar without molecular weight, and vice versa. Forcing a single canonical would mean every concentration that doesn't ship its MW becomes uninterpretable.
So
concentrationhas multiple optional canonical sub-fields, and the migrator populates whichever the source unit is computable into:Source-unit → canonical mapping (in the did-matlab migrator)
molarmolarmolarmolarmolargrams_per_litergrams_per_litergrams_per_litermass_fractionvolume_fractionsource_*onlyUnknown source units leave every canonical sub-field absent but still preserve
source_unit/source_valueso consumers retain the raw value and can compute canonicals later when the table grows.Did_schema_meta.json change
"concentration"to the type enum at line 213 area.Did-matlab validator change
did2.schema.cache.validateDocumentswitch case for composite types adds'concentration'. Same shape check as the other composites:must be a struct.Design choices to push back on if you disagree
mixture[*].amount.molar IS NOT NULL AND mixture[*].amount.molar > Xinstead of branching on akindfield. Downside: schema is slightly busier than a discriminator-style composite.mixture(notmixture_table,solutes, orbath_contents). Open to renaming.amountfield name for the per-chemical concentration (notconcentration, since the field type is itselfconcentration). Avoidsamount.concentration.molar-style stutter.epochidrestored as a superclass. v1 carried it; the previous V_delta draft dropped it. This means migrated v2 docs will validate against theepochidblock (already present in v1).ontologyUnitCURIE column dropped during migration. v1 had bothontologyUnit(CURIE) andunitName(text). V_delta keeps only the text viasource_unit. An obvious extension isamount.source_unit_node(CURIE) if the ontology classification of the unit matters downstream.Corpus impact (Python simulator after this redesign + migrator):