Skip to content

feat: embed specifications, fetch_versions toggle, graceful interrupt in the crawler#49

Merged
andrew2net merged 2 commits into
lutaml-integrationfrom
rt-embed-fetch-versions
Jun 3, 2026
Merged

feat: embed specifications, fetch_versions toggle, graceful interrupt in the crawler#49
andrew2net merged 2 commits into
lutaml-integrationfrom
rt-embed-fetch-versions

Conversation

@andrew2net
Copy link
Copy Markdown
Contributor

@andrew2net andrew2net commented Jun 3, 2026

Ports the three genuinely useful optimizations from the superseded PR #45 onto the lutaml-integration fetcher, adapted to the new parallel/SafeRealize design. The core of #45 (W3C-API migration + delegating rate-limiting to w3c_api) is already on lutaml-integration; these were the only parts not yet carried over.

1. embed

  • Fetch the specifications index with embed: true; each spec is realized from the page's inlined _embedded payload instead of a per-spec HTTP request. Verified against the live API: in-memory realize ~2 ms vs. a network round-trip, one request per page instead of one per spec.
  • Each spec link is queued with its page; the worker hands the page back to realize as parent_resource so the embedded path is used across threads.
  • Pagination by page number through the client's fetch path. This matters: the register only repopulates _embedded on fetch, so realizing the next link (the obvious approach) drops embedded data and would force a per-spec HTTP request on every page after the first. Confirmed empirically.
  • SafeRealize#realize gains an optional parent_resource: (default nil keeps the existing remote-fetch behavior for version-history callsites).

2. fetch_versions toggle

  • RELATON_W3C_FETCH_VERSIONS=false skips each spec's version-history fan-out (version_history + predecessor/successor versions — the bulk of the requests) for a fast, shallow crawl. Defaults to enabled (complete dataset).
  • Extracted the fan-out into #fetch_versions, guarded by a new .fetch_versions? class method that mirrors the existing .concurrency env-var pattern.

3. Graceful SIGINT (Ctrl-C)

  • A full crawl is long-running, so Ctrl-C now winds it down cleanly instead of killing the process mid-write. A scoped SIGINT trap sets an @interrupted flag; the producer stops queuing and workers stop after their in-flight spec (draining the SizedQueue so the pool reaches its poison pills without deadlocking), then the index of everything fetched so far is saved.
  • Trap body is minimal (no I/O/locking — trap context is restricted); the notice is printed from the main thread. The previous INT handler is restored afterwards so the trap doesn't leak into the host process.
  • Verified with a real SIGINT against an otherwise-infinite fake crawl: stops immediately (~0.02 s after the signal), saves, restores the handler.

Docs & tests

  • README.adoc + CLAUDE.md document RELATON_W3C_FETCH_CONCURRENCY, RELATON_W3C_FETCH_VERSIONS, embed, and graceful interrupt.
  • Updated the fetcher specs for the new fetch_spec(spec, page) / realize(obj, parent_resource:) signatures; added coverage for embed pagination, the page hand-off, .fetch_versions? parsing, interrupt-stops-but-saves, and SIGINT-handler restoration. Full suite: 78 examples, 0 failures.

Verified end-to-end against the live API: embed realize → DataParser.parseItemData (correct docnumber/title), the fetch_versions toggle, and graceful interrupt.

🤖 Generated with Claude Code

andrew2net and others added 2 commits June 3, 2026 18:20
Port the two useful crawler optimizations from the superseded PR #45 onto
the lutaml-integration fetcher.

embed:
- Fetch the specifications index with `embed: true` so each spec is realized
  from the page's inlined `_embedded` payload instead of a per-spec HTTP
  request (one request per page rather than one per specification — verified
  ~2ms in-memory realize vs. a network round-trip).
- Queue each spec link together with its page and hand the page back to
  `realize` as the `parent_resource` so the embedded path is used in the
  parallel workers.
- Paginate by page number through the client's fetch path; only `fetch`
  repopulates `_embedded`, whereas realizing the `next` link drops it and
  would force a per-spec HTTP request on every page after the first.
- SafeRealize#realize now accepts a `parent_resource:` (default nil keeps the
  plain remote-fetch behavior for the version-history callsites).

fetch_versions toggle:
- RELATON_W3C_FETCH_VERSIONS=false skips each spec's version-history fan-out
  (version_history, predecessor/successor versions) for a faster, shallower
  crawl. Defaults to enabled for a complete dataset. Extracted the fan-out
  into a #fetch_versions method guarded by the new .fetch_versions? class
  method (mirrors the existing .concurrency env-var pattern).

Docs: document both env vars in README.adoc and CLAUDE.md.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Port the third optimization from the superseded PR #45, adapted to the
parallel fetcher: Ctrl-C now winds the crawl down cleanly instead of killing
the process mid-write and losing the run.

- A scoped SIGINT trap sets an @interrupted flag (trap body kept minimal — no
  I/O or locking, since trap context is restricted). The producer stops
  queuing new specs and the workers stop processing after their in-flight
  spec, draining the queue so the pool reaches its poison pills without
  deadlocking on the SizedQueue.
- The index of everything fetched so far is then saved, and a notice is
  printed from the main thread (not the trap).
- The previous INT handler is restored on the way out, so the trap doesn't
  leak into the host process.

Extracted the pagination loop into #enqueue_specs and added #with_interrupt_handler.
Verified with a real SIGINT against an otherwise-infinite fake crawl: it stops
immediately, saves, and restores the handler. Specs cover the interrupt-stops-
but-saves path and handler restoration. Docs updated (README + CLAUDE).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@andrew2net andrew2net changed the title feat: embed specifications + fetch_versions toggle in the crawler feat: embed specifications, fetch_versions toggle, graceful interrupt in the crawler Jun 3, 2026
@andrew2net andrew2net merged commit 7b0b0b6 into lutaml-integration Jun 3, 2026
11 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant