Skip to content

Giant tile-spanning blotches survive clustering pipeline #10

Description

@michaelaye

Summary

Tile APF0000pqj (obsid ESP_020568_0950) contains a blotch entry with radius_1=450.6, radius_2=356.6 — larger than the tile itself (840×648 px). This is not an isolated case: 4,418 blotches (1.03% of catalog) have at least one radius > 200 px, and 36 blotches have both radii > 300 px (truly tile-spanning).

Root cause

The radii sub-clustering works correctly — it groups similarly-sized markings together. The problem is that there is no physical bounds filter to reject clusters whose averaged dimensions exceed the tile.

For APF0000pqj, 4 users drew tile-spanning ellipses centered near tile center (420, 324):

User x y radius_1 radius_2 offset from center
cosycoysh 402 366 458 372 (18, 42)
Chris.Parker 403 328 465 349 (17, 4)
2715 407 320 377 320 (14, 4)
andyt26767 414 348 429 349 (6, 24)

These users essentially marked "the whole tile is one big feature" — which is not a meaningful scientific measurement.

Suggested mitigations

Option A: Physical bounds filter on cluster output

Reject any averaged blotch cluster where radius_1 or radius_2 exceeds a threshold (e.g., half the tile dimension — 420 or 324 px). Simple, conservative, catches the most egregious cases.

Option B: Center-proximity + large-radius heuristic

Filter markings where the center is near the tile center AND the radii are large. The rationale: a legitimate large blotch can be centered anywhere, but a "whole tile" marking will tend toward the center. Analysis of the 36 tile-spanning blotches shows only 4 are within 100 px of tile center — but most legitimate large features aren't centered either, so this may not add much over Option A.

Option C: Pre-filter individual markings before clustering

In filter_data() or at the start of cluster_image_id(), discard individual markings where radius_1 > max_allowed (e.g., 420 px). This prevents them from entering the clustering pipeline at all and is arguably the cleanest solution — these markings represent user error, not scientific signal.

Recommendation

Option C (pre-filter) is cleanest: discard individual blotch markings where either radius exceeds half the tile width (420 px) before clustering. This removes the noise at source without adding post-hoc filters. Option A as a safety net on the output side would catch any edge cases that slip through.

Scale of impact

  • 4,418 blotches with radius > 200 px (1.03% of catalog)
  • 36 blotches with both radii > 300 px
  • 80% of radius > 200 px blotches are far from tile center (>200 px offset), suggesting many are legitimate large features — the filter threshold matters

Catalog version

v3.1

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions