fix(watcher): debounce fsnotify events + atomic index/links updates#10
Merged
Merged
Conversation
Fixe deux races identifiees lors de la livraison Phase 4 : 1. **Lock split dans handleEvent** : indexFile() et BuildBacklinks() faisaient 2 prises de lock distinctes. Un lecteur concurrent pouvait observer un etat ou la page existait dans c.pages mais ou les backlinks ne refletaient pas encore le nouveau contenu (fenetre intermediaire visible). 2. **Atomic temp+rename** : macOS/iCloud sync genere des sequences Remove + Create rapprochees (~10ms) sur les rename atomiques (write_raw_page interne, editeurs externes). Pendant la fenetre, la page disparaissait temporairement de l'index — typique vault-preflight burst writes (Rapport_Scout puis dispatch en cascade) ou gmail_sync rerun. **Solution combinee** : - **Helpers atomiques** indexFileWithLinks / removePageWithLinks : prennent le lock UNE fois et font index+rebuildLinks (ou remove+rebuildLinks) sans fenetre intermediaire visible aux lecteurs. - **Coalescer/debounce des events fsnotify** : timer par path (defaultDelay 100ms). Au reception d'un event, le timer existant pour le meme path est reset. La resolution finale (apres 100ms d'inactivite) lit l'etat disque reel et applique remove OU reindex selon. Race-free vis-a-vis des atomic rename : peu importe la sequence Remove/Create recue, le resultat reflete toujours ce qui est sur disque au moment du resolve. vault/vault.go : - Client struct enriche : pendingMu, pendingEvents, debounceDelay. - handleEvent simplifie : delegue au scheduleEventResolution (single switch case pour Create|Write|Remove|Rename — tous coalesces). vault/watcher_debounce.go (nouveau, 120 lignes) : scheduleEventResolution, resolveFileState, indexFileWithLinks, removePageWithLinks, plus un helper flushPendingEventsForTest pour les tests sans wall-clock. vault/watcher_debounce_test.go (nouveau, 5 tests) : - TestDebounce_AtomicRenameNoMissingPage : page reste accessible pendant la fenetre de race. - TestDebounce_BurstWritesCoalesce : 10 writes burst -> 1 reindex final (pas une cascade). - TestIndexFileWithLinks_Atomic : 50 reindex + lecteur concurrent -> 0 incoherence (atomicite prouvee). - TestRemovePageWithLinks_Atomic : analogue pour suppression. - TestScheduleEventResolution_Coalesce : 5 events -> 1 timer pending -> 1 resolve. vault/vault_test.go : sleeps 100->250ms dans TestWatchFile* (debounce 100ms + marge resolve). Smoke E2E direct contre service launchd reel : 5 write_raw_page burst HTTP puis get_page immediat -> 5/5 accessibles, 0 inaccessibles. Suite Go complete OK (vault, tools, parser, graph, types).
- Translate comments to English for codebase consistency - Fix delete-race in scheduleEventResolution: closure could delete a newer pending event installed between timer-fire and closure-lock acquisition, breaking coalescing under load. Match by pointer identity before deleting. - Stop pending debounce timers in Close() to prevent post-shutdown resolve calls and log spam. - Bump debounceDelay explicitly in two fsnotify-timing-dependent tests (atomic-rename and burst-coalesce) for CI flake protection. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
f2d01f1 to
165a19b
Compare
Owner
|
Thanks for this, @jacky1967! Solid diagnosis and the architecture is clean. Rebased onto current main and added a small polish commit on top: translated comments to English, fixed a subtle delete-race in Tests green, merging. Really appreciate the work. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Fixes two race conditions in the Obsidian backend's file watcher that surface under rapid concurrent writes (typical of multi-writer scenarios where multiple clients write to the vault through the MCP layer):
handleEventperformedindexFile()andBuildBacklinks()as two separate lock acquisitions, leaving a brief window where readers could see a page inc.pageswhose backlinks did not yet reflect the new content.Removefollowed quickly byCreateevents on fsnotify (~10 ms gap). During that window the page was temporarily absent from the index.Background / motivation
This bug surfaced while wiring up the Hermès Agent project to use graphthulhu as a single-writer for an Obsidian vault. Three independent producers (LLM via stdio, plugin via HTTP, cron script via HTTP) share a single
graphthulhusubprocess in Streamable HTTP mode. Burst writes (a Scout report followed by a cascade of dispatch writes, or Gmail attachments uploaded in sequence) reliably triggered the second race; the first one was visible as occasional missing backlinks in tests.The fix itself is generic and does not depend on Hermès — it applies to any setup where multiple clients write to the vault through the MCP layer, or where external editors save with atomic temp+rename.
Solution
Two complementary changes:
1. Atomic helpers (single lock for index + links)
vault/vault.goadds two helpers that acquirec.muonce and run the index update +rebuildLinksLocked()together:indexFileWithLinks(relPath, content, info)— replacesindexFile()+BuildBacklinks()in the watcher path.removePageWithLinks(lowerName)— replacesremovePageFromIndex()+BuildBacklinks().Existing
BuildBacklinks()andremovePageFromIndex()are preserved for callers that need them separately (e.g.Reload()).2. Debounce fsnotify events (per-path coalescer)
New
vault/watcher_debounce.go(~120 lines):scheduleEventResolution(absPath, relPath)— schedules atime.Timerper path. If a new event arrives for the same path, the existing timer is reset.resolveFileState(absPath, relPath)— fires afterdebounceDelay(default 100 ms) and reads disk state: file present →indexFileWithLinks; file absent →removePageWithLinks. Race-free with respect to atomic rename: regardless of the Remove/Create sequence received, the final state always matches what is on disk at resolution time.Clientstruct gainspendingMu,pendingEvents map[string]*pendingEvent,debounceDelay time.Duration(configurable for tests; defaults todefaultDebounceDelay = 100 * time.Millisecond).handleEventis simplified: it now delegates Create/Write/Remove/Rename events toscheduleEventResolution(single switch arm).The 100 ms delay is short enough not to noticeably delay genuine deletes or writes, and long enough to absorb the typical macOS/iCloud atomic-rename window (~10 ms observed on iCloud-synced vaults).
Test plan
5 new tests in
vault/watcher_debounce_test.go:TestDebounce_AtomicRenameNoMissingPage— page remains accessible during the race window (prior code: race observable; this PR: race eliminated).TestDebounce_BurstWritesCoalesce— 10 rapid writes → 1 final reindex (not a cascade of 10).TestIndexFileWithLinks_Atomic— 50 reindex calls + concurrent reader → 0 inconsistencies (concurrent reader cannot observe a state where the page exists but backlinks don't reflect it).TestRemovePageWithLinks_Atomic— analogous for removal.TestScheduleEventResolution_Coalesce— 5 schedules → 1 pending timer → 1 resolve.vault/vault_test.goTestWatchFile*sleeps adjusted from 100 ms → 250 ms (debounce 100 ms + resolve margin).Full Go test suite passes (
vault,tools,parser,graph,types).End-to-end smoke against a running MCP HTTP service: 5 burst
write_raw_pagecalls followed by immediateget_pageon each → 5/5 accessible, 0 inaccessible. No regression observed in production usage over the past 24h.Notes
BuildBacklinks()andremovePageFromIndex()keep working.flushPendingEventsForTest()is exposed as a non-public helper for unit tests that need synchronous flushing withouttime.Sleep.