Skip to content

Markings killed by sub-clustering are lost to the large-object run #11

Description

@michaelaye

Summary

Markings that enter an XY cluster in the small-object run but whose cluster is subsequently destroyed by radii (or angle) sub-clustering are permanently removed from the pool. They never reach the large-object run, even though they might form valid clusters there.

Mechanism

In _cluster_pipeline (05e_production.dbscan.ipynb), _calculate_unclustered is called immediately after XY clustering, before radii and angle sub-clustering:

xyclusters = self.cluster_xy(data, eps)
xyclusters = list(xyclusters)
self._calculate_unclustered(data, xyclusters)    # removes all XY-clustered points NOW
# ... radii sub-clustering may destroy some clusters ...
# ... angle sub-clustering may destroy more ...
# ... but their members are already gone from self.remaining

So any marking that was part of an XY cluster — even one that doesn't survive — is excluded from self.remaining and never reaches the large-object run.

Example: APF0000jei (ESP_021829_0985)

A large, clearly visible elliptical blotch at the bottom-center of the tile is marked by 12 different users at positions spanning x: 246–362, y: 509–584 (116×75 px spread):

User x y r1 r2
MyC0w 246 509 227 123
Ben Teji 259 550 121 91
Strufus78 285 583 193 152
jmortimer 292 533 141 106
umajv 294 548 195 146
WKDCon 304 552 138 104
basty808 308 563 131 98
Frankee64 314 555 240 113
Rad awes 327 570 161 121
not-logged-in 330 584 168 126
jackielivesey 350 554 114 86
bruno.edwards 362 583 174 130

What happens:

  1. Small run (eps_xy=10): 3 markings cluster (WKDCon, basty808, Frankee64 at x: 304–314). Radii sub-clustering with eps=30 destroys this cluster — radii span 131–240, far beyond eps=30. All 3 become noise, but _calculate_unclustered already removed them.

  2. Large run (eps_xy=25): Only 9 of the 12 markings remain. With eps=25, they span x: 246–362 (116 px) — too wide to fully connect. A sub-group clusters, but after radii sub-clustering it falls below min_samples and is filtered.

  3. Result: No blotch entry in the catalog, despite 12 independent users agreeing the feature exists.

If the 3 killed-by-radii markings were returned to the pool, the large run would have all 12 markings. With eps=25, a 7-member cluster forms (x: 292–330), and radii sub-clustering with eps=50 produces a surviving 6-member cluster (r1: 131–195).

Suggested fix

Move _calculate_unclustered after all sub-clustering stages, using finalclusters instead of xyclusters:

def _cluster_pipeline(self, kind, data, eps, eps_rad):
    xyclusters = self.cluster_xy(data, eps)
    xyclusters = list(xyclusters)
    # _calculate_unclustered was here — too early
    if self.with_radii and eps_rad is not None:
        last = self.cluster_radii(xyclusters, eps_rad)
    else:
        last = xyclusters
    last = list(last)
    if self.with_angles and self.eps_values[kind]["angle"] is not None:
        finalclusters = self.cluster_angles(last, kind)
    else:
        finalclusters = last
    finalclusters = list(finalclusters)
    self.finalclusters = finalclusters
    self._calculate_unclustered(data, finalclusters)  # ← moved here
    averaged = get_average_objects(finalclusters, kind)
    ...

No downside: members of clusters that survived all stages are still excluded from remaining. Only markings whose clusters were fragmented by sub-clustering are recovered.

Target version

v3.2

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions