Skip to content

fix(watcher): debounce fsnotify events + atomic index/links updates#10

Merged
skridlevsky merged 2 commits into
skridlevsky:mainfrom
jacky1967:pr/watcher-debounce-fix
Apr 30, 2026
Merged

fix(watcher): debounce fsnotify events + atomic index/links updates#10
skridlevsky merged 2 commits into
skridlevsky:mainfrom
jacky1967:pr/watcher-debounce-fix

Conversation

@jacky1967
Copy link
Copy Markdown
Contributor

Summary

Fixes two race conditions in the Obsidian backend's file watcher that surface under rapid concurrent writes (typical of multi-writer scenarios where multiple clients write to the vault through the MCP layer):

  1. Lock split between index and backlinks updateshandleEvent performed indexFile() and BuildBacklinks() as two separate lock acquisitions, leaving a brief window where readers could see a page in c.pages whose backlinks did not yet reflect the new content.
  2. Atomic temp+rename surfaces a transient missing page — macOS/iCloud sync and atomic write-to-temp+rename (used internally and by external editors) generates Remove followed quickly by Create events on fsnotify (~10 ms gap). During that window the page was temporarily absent from the index.

Background / motivation

This bug surfaced while wiring up the Hermès Agent project to use graphthulhu as a single-writer for an Obsidian vault. Three independent producers (LLM via stdio, plugin via HTTP, cron script via HTTP) share a single graphthulhu subprocess in Streamable HTTP mode. Burst writes (a Scout report followed by a cascade of dispatch writes, or Gmail attachments uploaded in sequence) reliably triggered the second race; the first one was visible as occasional missing backlinks in tests.

The fix itself is generic and does not depend on Hermès — it applies to any setup where multiple clients write to the vault through the MCP layer, or where external editors save with atomic temp+rename.

Solution

Two complementary changes:

1. Atomic helpers (single lock for index + links)

vault/vault.go adds two helpers that acquire c.mu once and run the index update + rebuildLinksLocked() together:

  • indexFileWithLinks(relPath, content, info) — replaces indexFile() + BuildBacklinks() in the watcher path.
  • removePageWithLinks(lowerName) — replaces removePageFromIndex() + BuildBacklinks().

Existing BuildBacklinks() and removePageFromIndex() are preserved for callers that need them separately (e.g. Reload()).

2. Debounce fsnotify events (per-path coalescer)

New vault/watcher_debounce.go (~120 lines):

  • scheduleEventResolution(absPath, relPath) — schedules a time.Timer per path. If a new event arrives for the same path, the existing timer is reset.
  • resolveFileState(absPath, relPath) — fires after debounceDelay (default 100 ms) and reads disk state: file present → indexFileWithLinks; file absent → removePageWithLinks. Race-free with respect to atomic rename: regardless of the Remove/Create sequence received, the final state always matches what is on disk at resolution time.

Client struct gains pendingMu, pendingEvents map[string]*pendingEvent, debounceDelay time.Duration (configurable for tests; defaults to defaultDebounceDelay = 100 * time.Millisecond).

handleEvent is simplified: it now delegates Create/Write/Remove/Rename events to scheduleEventResolution (single switch arm).

The 100 ms delay is short enough not to noticeably delay genuine deletes or writes, and long enough to absorb the typical macOS/iCloud atomic-rename window (~10 ms observed on iCloud-synced vaults).

Test plan

5 new tests in vault/watcher_debounce_test.go:

  • TestDebounce_AtomicRenameNoMissingPage — page remains accessible during the race window (prior code: race observable; this PR: race eliminated).
  • TestDebounce_BurstWritesCoalesce — 10 rapid writes → 1 final reindex (not a cascade of 10).
  • TestIndexFileWithLinks_Atomic — 50 reindex calls + concurrent reader → 0 inconsistencies (concurrent reader cannot observe a state where the page exists but backlinks don't reflect it).
  • TestRemovePageWithLinks_Atomic — analogous for removal.
  • TestScheduleEventResolution_Coalesce — 5 schedules → 1 pending timer → 1 resolve.

vault/vault_test.go TestWatchFile* sleeps adjusted from 100 ms → 250 ms (debounce 100 ms + resolve margin).

Full Go test suite passes (vault, tools, parser, graph, types).

End-to-end smoke against a running MCP HTTP service: 5 burst write_raw_page calls followed by immediate get_page on each → 5/5 accessible, 0 inaccessible. No regression observed in production usage over the past 24h.

Notes

  • No public API change; existing callers of BuildBacklinks() and removePageFromIndex() keep working.
  • flushPendingEventsForTest() is exposed as a non-public helper for unit tests that need synchronous flushing without time.Sleep.
  • The fix targets the Obsidian backend specifically (where fsnotify-driven re-indexing matters); Logseq backends that drive the index through DataScript transactions are unaffected.

jacky1967 and others added 2 commits April 30, 2026 20:52
Fixe deux races identifiees lors de la livraison Phase 4 :

1. **Lock split dans handleEvent** : indexFile() et BuildBacklinks() faisaient
   2 prises de lock distinctes. Un lecteur concurrent pouvait observer un
   etat ou la page existait dans c.pages mais ou les backlinks ne refletaient
   pas encore le nouveau contenu (fenetre intermediaire visible).

2. **Atomic temp+rename** : macOS/iCloud sync genere des sequences Remove +
   Create rapprochees (~10ms) sur les rename atomiques (write_raw_page interne,
   editeurs externes). Pendant la fenetre, la page disparaissait
   temporairement de l'index — typique vault-preflight burst writes
   (Rapport_Scout puis dispatch en cascade) ou gmail_sync rerun.

**Solution combinee** :

- **Helpers atomiques** indexFileWithLinks / removePageWithLinks : prennent
  le lock UNE fois et font index+rebuildLinks (ou remove+rebuildLinks) sans
  fenetre intermediaire visible aux lecteurs.

- **Coalescer/debounce des events fsnotify** : timer par path (defaultDelay
  100ms). Au reception d'un event, le timer existant pour le meme path est
  reset. La resolution finale (apres 100ms d'inactivite) lit l'etat disque
  reel et applique remove OU reindex selon. Race-free vis-a-vis des atomic
  rename : peu importe la sequence Remove/Create recue, le resultat reflete
  toujours ce qui est sur disque au moment du resolve.

vault/vault.go :
- Client struct enriche : pendingMu, pendingEvents, debounceDelay.
- handleEvent simplifie : delegue au scheduleEventResolution (single switch
  case pour Create|Write|Remove|Rename — tous coalesces).

vault/watcher_debounce.go (nouveau, 120 lignes) : scheduleEventResolution,
resolveFileState, indexFileWithLinks, removePageWithLinks, plus un helper
flushPendingEventsForTest pour les tests sans wall-clock.

vault/watcher_debounce_test.go (nouveau, 5 tests) :
- TestDebounce_AtomicRenameNoMissingPage : page reste accessible pendant la
  fenetre de race.
- TestDebounce_BurstWritesCoalesce : 10 writes burst -> 1 reindex final
  (pas une cascade).
- TestIndexFileWithLinks_Atomic : 50 reindex + lecteur concurrent ->
  0 incoherence (atomicite prouvee).
- TestRemovePageWithLinks_Atomic : analogue pour suppression.
- TestScheduleEventResolution_Coalesce : 5 events -> 1 timer pending ->
  1 resolve.

vault/vault_test.go : sleeps 100->250ms dans TestWatchFile* (debounce 100ms
+ marge resolve).

Smoke E2E direct contre service launchd reel : 5 write_raw_page burst HTTP
puis get_page immediat -> 5/5 accessibles, 0 inaccessibles. Suite Go
complete OK (vault, tools, parser, graph, types).
- Translate comments to English for codebase consistency
- Fix delete-race in scheduleEventResolution: closure could delete a
  newer pending event installed between timer-fire and closure-lock
  acquisition, breaking coalescing under load. Match by pointer
  identity before deleting.
- Stop pending debounce timers in Close() to prevent post-shutdown
  resolve calls and log spam.
- Bump debounceDelay explicitly in two fsnotify-timing-dependent
  tests (atomic-rename and burst-coalesce) for CI flake protection.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@skridlevsky skridlevsky force-pushed the pr/watcher-debounce-fix branch from f2d01f1 to 165a19b Compare April 30, 2026 19:05
@skridlevsky
Copy link
Copy Markdown
Owner

Thanks for this, @jacky1967! Solid diagnosis and the architecture is clean.

Rebased onto current main and added a small polish commit on top: translated comments to English, fixed a subtle delete-race in scheduleEventResolution (closure could delete a newer pending event after timer fire), added timer cleanup to Close(), and bumped debounceDelay in two timing-dependent tests for CI flake protection.

Tests green, merging. Really appreciate the work.

@skridlevsky skridlevsky merged commit c7d9468 into skridlevsky:main Apr 30, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants