fix: close concurrent-insert duplicate-URL race on poi_news#444
Conversation
saveNewsItems dedups with a SELECT-then-INSERT that has no DB-level guard: two concurrent collection branches can both read "not present" and both insert the same URL. Observed firing prolifically under multi-POI runs (e.g. shared regional articles, and repeatedly on the historicbridges POI), landing duplicate pending items in the moderation queue. - migration 069: partial unique index on normalized news source_url, plus an idempotent status-aware safety-net dedup (published > pending > rejected). Events excluded: distinct events legitimately share a generic source URL. - newsService: ON CONFLICT DO NOTHING + rowCount check so a lost race logs as a duplicate instead of erroring. Also fix 3 unrelated pre-existing test failures surfaced by the run: - ogShareImages: assert the real source-image -> POI-photo -> brand priority for news permalinks (PR #385 behavior), not an unconditional POI thumbnail. - ui.integration (2x More Info link): open a POI that has a more_info_link via its bare-path permalink instead of clicking an arbitrary first marker, which now often lands on an amenity POI with no link. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Code Review
This pull request addresses a concurrent-insert duplicate-URL race condition on the poi_news table by introducing a unique index on normalized source URLs and handling conflicts gracefully in newsService.js using ON CONFLICT DO NOTHING. Additionally, integration tests have been updated to ensure more deterministic behavior when testing UI elements and share images. Regarding the database migration, the current self-join approach for deleting pre-existing duplicates can be highly inefficient on large tables; it is recommended to optimize this query using a window function like ROW_NUMBER() to perform the cleanup in a single pass.
| DELETE FROM poi_news a | ||
| USING poi_news b | ||
| WHERE a.source_url IS NOT NULL | ||
| AND b.source_url IS NOT NULL | ||
| AND LOWER(REGEXP_REPLACE(a.source_url, '/+$', '')) = LOWER(REGEXP_REPLACE(b.source_url, '/+$', '')) | ||
| AND (CASE a.moderation_status WHEN 'published' THEN 0 WHEN 'pending' THEN 1 ELSE 2 END, a.id) | ||
| > (CASE b.moderation_status WHEN 'published' THEN 0 WHEN 'pending' THEN 1 ELSE 2 END, b.id); |
There was a problem hiding this comment.
Using a self-join with USING to delete duplicates can be highly inefficient on large tables because it requires an
A more performant and standard PostgreSQL approach is to use a subquery with ROW_NUMBER(). This scans the table once, partitions and sorts the rows, and deletes all but the first row in each partition.
| DELETE FROM poi_news a | |
| USING poi_news b | |
| WHERE a.source_url IS NOT NULL | |
| AND b.source_url IS NOT NULL | |
| AND LOWER(REGEXP_REPLACE(a.source_url, '/+$', '')) = LOWER(REGEXP_REPLACE(b.source_url, '/+$', '')) | |
| AND (CASE a.moderation_status WHEN 'published' THEN 0 WHEN 'pending' THEN 1 ELSE 2 END, a.id) | |
| > (CASE b.moderation_status WHEN 'published' THEN 0 WHEN 'pending' THEN 1 ELSE 2 END, b.id); | |
| DELETE FROM poi_news | |
| WHERE id IN ( | |
| SELECT id | |
| FROM ( | |
| SELECT id, | |
| ROW_NUMBER() OVER ( | |
| PARTITION BY LOWER(REGEXP_REPLACE(source_url, '/+$', '')) | |
| ORDER BY CASE moderation_status WHEN 'published' THEN 0 WHEN 'pending' THEN 1 ELSE 2 END, id | |
| ) as rn | |
| FROM poi_news | |
| WHERE source_url IS NOT NULL | |
| ) t | |
| WHERE rn > 1 | |
| ); |
Summary
saveNewsItemsdedups news with aSELECT-then-INSERTthat has no DB-level guard. Two collection branches processing the same URL concurrently (e.g. a regional article relevant to several nearby POIs, or the same page reached via two link paths) both read "not present" and both insert — producing duplicate pending items in the moderation queue. Investigation against prod found this firing prolifically during multi-POI runs (a single run created ~16 duplicate pending pairs; 69 dup groups / 81 redundant rows accumulated after the monthly collection).Fix
UNIQUEindex on normalizedsource_url(LOWER(REGEXP_REPLACE(source_url,'/+$','')),WHERE source_url IS NOT NULL), plus an idempotent, status-aware safety-netDELETEthat collapses any pre-existing duplicates (keeps published > pending > rejected, one row per URL → preserves dedup tombstones). Events are intentionally excluded — distinct events legitimately share a generic source URL (a venue homepage/calendar).ON CONFLICT DO NOTHING+rowCountcheck, so a lost race logs as a duplicate instead of erroring. Exactly one insert wins; never zero, never two.Also: 3 unrelated pre-existing test fixes (surfaced by the run)
ogShareImages— assert the real source-image → POI-photo → brand priority for news permalinks (PR feat: source article image for news/event share previews #385 behavior) instead of an unconditional POI thumbnail.ui.integration(2× More Info link) — open a POI that has amore_info_linkvia its bare-path permalink/<slug>instead of clicking an arbitrary first marker, which now often lands on an amenity POI with no link.Test plan
./run.sh test→ 30 files, 370 passed, 1 skipped, 0 failed.DELETEcollapses dups, index builds, andON CONFLICT DO NOTHINGno-ops a duplicate insert without error.🤖 Generated with Claude Code