Skip to content

fix: close concurrent-insert duplicate-URL race on poi_news#444

Merged
fatherlinux merged 1 commit into
masterfrom
fix/news-url-dedup-race
May 29, 2026
Merged

fix: close concurrent-insert duplicate-URL race on poi_news#444
fatherlinux merged 1 commit into
masterfrom
fix/news-url-dedup-race

Conversation

@fatherlinux
Copy link
Copy Markdown
Member

Summary

saveNewsItems dedups news with a SELECT-then-INSERT that has no DB-level guard. Two collection branches processing the same URL concurrently (e.g. a regional article relevant to several nearby POIs, or the same page reached via two link paths) both read "not present" and both insert — producing duplicate pending items in the moderation queue. Investigation against prod found this firing prolifically during multi-POI runs (a single run created ~16 duplicate pending pairs; 69 dup groups / 81 redundant rows accumulated after the monthly collection).

Fix

  • migration 069 — partial UNIQUE index on normalized source_url (LOWER(REGEXP_REPLACE(source_url,'/+$','')), WHERE source_url IS NOT NULL), plus an idempotent, status-aware safety-net DELETE that collapses any pre-existing duplicates (keeps published > pending > rejected, one row per URL → preserves dedup tombstones). Events are intentionally excluded — distinct events legitimately share a generic source URL (a venue homepage/calendar).
  • newsService.jsON CONFLICT DO NOTHING + rowCount check, so a lost race logs as a duplicate instead of erroring. Exactly one insert wins; never zero, never two.

Also: 3 unrelated pre-existing test fixes (surfaced by the run)

  • ogShareImages — assert the real source-image → POI-photo → brand priority for news permalinks (PR feat: source article image for news/event share previews #385 behavior) instead of an unconditional POI thumbnail.
  • ui.integration (2× More Info link) — open a POI that has a more_info_link via its bare-path permalink /<slug> instead of clicking an arbitrary first marker, which now often lands on an amenity POI with no link.

Test plan

  • ./run.sh test → 30 files, 370 passed, 1 skipped, 0 failed.
  • Migration 069 validated against prod in a rolled-back transaction: safety-net DELETE collapses dups, index builds, and ON CONFLICT DO NOTHING no-ops a duplicate insert without error.
  • Gourmand: 0 violations.

🤖 Generated with Claude Code

saveNewsItems dedups with a SELECT-then-INSERT that has no DB-level guard:
two concurrent collection branches can both read "not present" and both
insert the same URL. Observed firing prolifically under multi-POI runs
(e.g. shared regional articles, and repeatedly on the historicbridges POI),
landing duplicate pending items in the moderation queue.

- migration 069: partial unique index on normalized news source_url, plus an
  idempotent status-aware safety-net dedup (published > pending > rejected).
  Events excluded: distinct events legitimately share a generic source URL.
- newsService: ON CONFLICT DO NOTHING + rowCount check so a lost race logs as
  a duplicate instead of erroring.

Also fix 3 unrelated pre-existing test failures surfaced by the run:
- ogShareImages: assert the real source-image -> POI-photo -> brand priority
  for news permalinks (PR #385 behavior), not an unconditional POI thumbnail.
- ui.integration (2x More Info link): open a POI that has a more_info_link via
  its bare-path permalink instead of clicking an arbitrary first marker, which
  now often lands on an amenity POI with no link.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request addresses a concurrent-insert duplicate-URL race condition on the poi_news table by introducing a unique index on normalized source URLs and handling conflicts gracefully in newsService.js using ON CONFLICT DO NOTHING. Additionally, integration tests have been updated to ensure more deterministic behavior when testing UI elements and share images. Regarding the database migration, the current self-join approach for deleting pre-existing duplicates can be highly inefficient on large tables; it is recommended to optimize this query using a window function like ROW_NUMBER() to perform the cleanup in a single pass.

Comment on lines +17 to +23
DELETE FROM poi_news a
USING poi_news b
WHERE a.source_url IS NOT NULL
AND b.source_url IS NOT NULL
AND LOWER(REGEXP_REPLACE(a.source_url, '/+$', '')) = LOWER(REGEXP_REPLACE(b.source_url, '/+$', ''))
AND (CASE a.moderation_status WHEN 'published' THEN 0 WHEN 'pending' THEN 1 ELSE 2 END, a.id)
> (CASE b.moderation_status WHEN 'published' THEN 0 WHEN 'pending' THEN 1 ELSE 2 END, b.id);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Using a self-join with USING to delete duplicates can be highly inefficient on large tables because it requires an $O(N^2)$ comparison of all rows matching the join condition.

A more performant and standard PostgreSQL approach is to use a subquery with ROW_NUMBER(). This scans the table once, partitions and sorts the rows, and deletes all but the first row in each partition.

Suggested change
DELETE FROM poi_news a
USING poi_news b
WHERE a.source_url IS NOT NULL
AND b.source_url IS NOT NULL
AND LOWER(REGEXP_REPLACE(a.source_url, '/+$', '')) = LOWER(REGEXP_REPLACE(b.source_url, '/+$', ''))
AND (CASE a.moderation_status WHEN 'published' THEN 0 WHEN 'pending' THEN 1 ELSE 2 END, a.id)
> (CASE b.moderation_status WHEN 'published' THEN 0 WHEN 'pending' THEN 1 ELSE 2 END, b.id);
DELETE FROM poi_news
WHERE id IN (
SELECT id
FROM (
SELECT id,
ROW_NUMBER() OVER (
PARTITION BY LOWER(REGEXP_REPLACE(source_url, '/+$', ''))
ORDER BY CASE moderation_status WHEN 'published' THEN 0 WHEN 'pending' THEN 1 ELSE 2 END, id
) as rn
FROM poi_news
WHERE source_url IS NOT NULL
) t
WHERE rn > 1
);

@fatherlinux fatherlinux merged commit 87417ec into master May 29, 2026
3 checks passed
@fatherlinux fatherlinux deleted the fix/news-url-dedup-race branch May 29, 2026 09:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant