Statistical Distribution Comparison for Mapping Validation #5

BorisDelange · 2026-01-16T15:35:54Z

BorisDelange
Jan 16, 2026
Maintainer

Context

When mapping source concepts to OMOP standard concepts, current tools (Usagi, LLM-based approaches) rely primarily on textual/semantic similarity. However, validating that a mapping is clinically correct often requires domain expertise.

We propose an additional validation approach: comparing statistical distributions between source data and reference distributions for target concepts.

Approach

INDICATE stores reference distribution profiles as JSON for each General Concept. When a user maps a source concept, they can compare their source data distribution against the expected distribution.

Example: Heart Rate (Adult profile)

{
  "data_types": ["numeric"],
  "numeric_data": {
    "min": 25,
    "max": 220,
    "mean": 82.4,
    "median": 78,
    "sd": 18.6,
    "p5": 52,
    "p25": 68,
    "p75": 92,
    "p95": 118
  },
  "histogram": [
    {"x": 30, "count": 1245},
    {"x": 40, "count": 4982},
    {"x": 50, "count": 37284},
    {"x": 60, "count": 124568},
    {"x": 70, "count": 286542},
    {"x": 80, "count": 324567},
    {"x": 90, "count": 248956},
    {"x": 100, "count": 124568},
    {"x": 110, "count": 56234},
    {"x": 120, "count": 24856},
    {"x": 130, "count": 8456},
    {"x": 140, "count": 2845},
    {"x": 150, "count": 956}
  ],
  "measurement_frequency": {"typical_interval": "hourly"},
  "missing_rate": 2.1
}

Use Case

A data engineer maps a source variable "FC" to the General Concept "Heart Rate". By uploading or entering their source distribution, they can visually compare:

Does the mean/median fall within expected ranges?
Is the distribution shape similar?
Are there outliers suggesting unit conversion issues (e.g., bpm vs Hz)?

This helps non-experts validate mappings without deep clinical knowledge.

Benefits

Visual validation: Side-by-side comparison of source vs reference distributions
Unit error detection: Distributions with unexpected ranges may indicate unit conversion issues
Confidence for non-experts: Data engineers can validate mappings without clinical expertise

Open Question: Where to find reference distributions?

This approach requires reference distributions for common clinical concepts. Currently, there is no standardized source for this.

Potential sources:

MIMIC-IV / eICU-CRD (open ICU databases)
Published reference ranges (but these are typically min/max only, not full distributions)

MaximMoinat · 2026-01-23T19:34:50Z

MaximMoinat
Jan 23, 2026

Quick reflection; this has been tried in the DataQualityDashboard to define plausible ranges for different measurements. This has not been successful, difficult to fine-tune and often produced many false positives. e.g. for different healthcare settings different ranges were plausible.

Before we pursue this further, we need to think about how we evaluate differences in distribution (when do we consider distributions different) and what we do when the distributions do not match (how do we determine which is the correct distribution).

Note that tools like Achilles already have the capability to create the value distribution per OMOP concept and visualise this in e.g. Atlas.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Statistical Distribution Comparison for Mapping Validation #5

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Statistical Distribution Comparison for Mapping Validation #5

Uh oh!

Uh oh!

BorisDelange Jan 16, 2026 Maintainer

Context

Approach

Example: Heart Rate (Adult profile)

Use Case

Benefits

Open Question: Where to find reference distributions?

Replies: 1 comment

Uh oh!

MaximMoinat Jan 23, 2026

BorisDelange
Jan 16, 2026
Maintainer

MaximMoinat
Jan 23, 2026