feat: embed specifications, fetch_versions toggle, graceful interrupt in the crawler#49
Merged
Merged
Conversation
Port the two useful crawler optimizations from the superseded PR #45 onto the lutaml-integration fetcher. embed: - Fetch the specifications index with `embed: true` so each spec is realized from the page's inlined `_embedded` payload instead of a per-spec HTTP request (one request per page rather than one per specification — verified ~2ms in-memory realize vs. a network round-trip). - Queue each spec link together with its page and hand the page back to `realize` as the `parent_resource` so the embedded path is used in the parallel workers. - Paginate by page number through the client's fetch path; only `fetch` repopulates `_embedded`, whereas realizing the `next` link drops it and would force a per-spec HTTP request on every page after the first. - SafeRealize#realize now accepts a `parent_resource:` (default nil keeps the plain remote-fetch behavior for the version-history callsites). fetch_versions toggle: - RELATON_W3C_FETCH_VERSIONS=false skips each spec's version-history fan-out (version_history, predecessor/successor versions) for a faster, shallower crawl. Defaults to enabled for a complete dataset. Extracted the fan-out into a #fetch_versions method guarded by the new .fetch_versions? class method (mirrors the existing .concurrency env-var pattern). Docs: document both env vars in README.adoc and CLAUDE.md. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Port the third optimization from the superseded PR #45, adapted to the parallel fetcher: Ctrl-C now winds the crawl down cleanly instead of killing the process mid-write and losing the run. - A scoped SIGINT trap sets an @interrupted flag (trap body kept minimal — no I/O or locking, since trap context is restricted). The producer stops queuing new specs and the workers stop processing after their in-flight spec, draining the queue so the pool reaches its poison pills without deadlocking on the SizedQueue. - The index of everything fetched so far is then saved, and a notice is printed from the main thread (not the trap). - The previous INT handler is restored on the way out, so the trap doesn't leak into the host process. Extracted the pagination loop into #enqueue_specs and added #with_interrupt_handler. Verified with a real SIGINT against an otherwise-infinite fake crawl: it stops immediately, saves, and restores the handler. Specs cover the interrupt-stops- but-saves path and handler restoration. Docs updated (README + CLAUDE). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Ports the three genuinely useful optimizations from the superseded PR #45 onto the
lutaml-integrationfetcher, adapted to the new parallel/SafeRealizedesign. The core of #45 (W3C-API migration + delegating rate-limiting to w3c_api) is already onlutaml-integration; these were the only parts not yet carried over.1. embed
embed: true; each spec is realized from the page's inlined_embeddedpayload instead of a per-spec HTTP request. Verified against the live API: in-memory realize ~2 ms vs. a network round-trip, one request per page instead of one per spec.realizeasparent_resourceso the embedded path is used across threads._embeddedonfetch, so realizing thenextlink (the obvious approach) drops embedded data and would force a per-spec HTTP request on every page after the first. Confirmed empirically.SafeRealize#realizegains an optionalparent_resource:(defaultnilkeeps the existing remote-fetch behavior for version-history callsites).2. fetch_versions toggle
RELATON_W3C_FETCH_VERSIONS=falseskips each spec's version-history fan-out (version_history + predecessor/successor versions — the bulk of the requests) for a fast, shallow crawl. Defaults to enabled (complete dataset).#fetch_versions, guarded by a new.fetch_versions?class method that mirrors the existing.concurrencyenv-var pattern.3. Graceful SIGINT (Ctrl-C)
@interruptedflag; the producer stops queuing and workers stop after their in-flight spec (draining theSizedQueueso the pool reaches its poison pills without deadlocking), then the index of everything fetched so far is saved.SIGINTagainst an otherwise-infinite fake crawl: stops immediately (~0.02 s after the signal), saves, restores the handler.Docs & tests
README.adoc+CLAUDE.mddocumentRELATON_W3C_FETCH_CONCURRENCY,RELATON_W3C_FETCH_VERSIONS, embed, and graceful interrupt.fetch_spec(spec, page)/realize(obj, parent_resource:)signatures; added coverage for embed pagination, the page hand-off,.fetch_versions?parsing, interrupt-stops-but-saves, and SIGINT-handler restoration. Full suite: 78 examples, 0 failures.Verified end-to-end against the live API: embed realize →
DataParser.parse→ItemData(correct docnumber/title), thefetch_versionstoggle, and graceful interrupt.🤖 Generated with Claude Code