Statistical Distribution Comparison for Mapping Validation #5
Replies: 1 comment
-
|
Quick reflection; this has been tried in the DataQualityDashboard to define plausible ranges for different measurements. This has not been successful, difficult to fine-tune and often produced many false positives. e.g. for different healthcare settings different ranges were plausible. Before we pursue this further, we need to think about how we evaluate differences in distribution (when do we consider distributions different) and what we do when the distributions do not match (how do we determine which is the correct distribution). Note that tools like Achilles already have the capability to create the value distribution per OMOP concept and visualise this in e.g. Atlas. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Context
When mapping source concepts to OMOP standard concepts, current tools (Usagi, LLM-based approaches) rely primarily on textual/semantic similarity. However, validating that a mapping is clinically correct often requires domain expertise.
We propose an additional validation approach: comparing statistical distributions between source data and reference distributions for target concepts.
Approach
INDICATE stores reference distribution profiles as JSON for each General Concept. When a user maps a source concept, they can compare their source data distribution against the expected distribution.
Example: Heart Rate (Adult profile)
{ "data_types": ["numeric"], "numeric_data": { "min": 25, "max": 220, "mean": 82.4, "median": 78, "sd": 18.6, "p5": 52, "p25": 68, "p75": 92, "p95": 118 }, "histogram": [ {"x": 30, "count": 1245}, {"x": 40, "count": 4982}, {"x": 50, "count": 37284}, {"x": 60, "count": 124568}, {"x": 70, "count": 286542}, {"x": 80, "count": 324567}, {"x": 90, "count": 248956}, {"x": 100, "count": 124568}, {"x": 110, "count": 56234}, {"x": 120, "count": 24856}, {"x": 130, "count": 8456}, {"x": 140, "count": 2845}, {"x": 150, "count": 956} ], "measurement_frequency": {"typical_interval": "hourly"}, "missing_rate": 2.1 }Use Case
A data engineer maps a source variable "FC" to the General Concept "Heart Rate". By uploading or entering their source distribution, they can visually compare:
This helps non-experts validate mappings without deep clinical knowledge.
Benefits
Open Question: Where to find reference distributions?
This approach requires reference distributions for common clinical concepts. Currently, there is no standardized source for this.
Potential sources:
Beta Was this translation helpful? Give feedback.
All reactions