Skip to content

feat(monitor): stateless change-tracking diff engine + LLM judge + self-host monitor#65

Merged
us merged 4 commits into
mainfrom
feat/monitor
May 30, 2026
Merged

feat(monitor): stateless change-tracking diff engine + LLM judge + self-host monitor#65
us merged 4 commits into
mainfrom
feat/monitor

Conversation

@us
Copy link
Copy Markdown
Owner

@us us commented May 30, 2026

Implements the opencore side of the /monitor feature (M1, M2, M6) per plans/MONITOR_PLAN.md.

  • M1 — new crw-diff crate (markdown git-diff + parse-diff AST from one similar op-stream, JSON per-field diff, mixed mode, binary hashing, mode-aware content hash); core types (OutputFormat::ChangeTracking, ChangeTracking* + ChangeJudgment); scrape-path wiring + POST /v1/change-tracking/diff (single+batch); /v1/capabilities; 4 metrics; OpenAPI 3.1/3.0.
  • M2crw-extract/judge.rs LLM meaningful-change judge (fixed schema, UNTRUSTED-delimiter injection defense) injected on changed pages.
  • M6 — feature-gated crw-monitor self-host crate (SQLite, UTC scheduler, set-level new/removed, capped judge, HMAC webhook), OFF by default.

Verification: whole workspace compiles, clippy -D warnings clean, 73 tests pass (crw-diff 18 + crw-core 19 + judge 3 + change-tracking integration 10 + crw-monitor 23). Open-core boundary gate verified: default cargo tree -p crw-server pulls no rusqlite/hmac/crw-monitor. Wire shapes (confidence string enum, meaningfulChanges objects) match real Firecrawl.

Deferred: self-host SMTP + crw monitor CLI/MCP surface.

us added 3 commits May 30, 2026 14:41
Introduces the opencore primitives for the /monitor feature (M1+M2):

- New crw-diff crate: pure, stateless diff over a caller-supplied previous
  snapshot. Markdown git-diff (unified text + parse-diff AST from one similar
  op-stream), JSON per-field path diff, mixed mode, binary/non-text hashing,
  mode-aware content hash, and a max-diff-changes truncation cap.
- crw-core types: OutputFormat::ChangeTracking (string variant, change-tracking
  alias), ChangeTrackingOptions/Snapshot/Result, ChangeDiff, DiffAst, and the
  ChangeJudgment wire shape (confidence low|medium|high, meaningfulChanges[]).
  ScrapeData gains content_type + change_tracking; ScrapeRequest gains
  change_tracking, goal, judge_enabled.
- Scrape-path wiring in single.rs (json-mode extraction) and content_type on
  crawl pages.
- POST /v1/change-tracking/diff (single + batch, presence-of-batch
  discriminator) and changeTracking advertised in /v1/capabilities.
- LLM meaningful-change judge (crw-extract/judge.rs) reusing the structured
  provider machinery with a fixed schema and UNTRUSTED-delimiter injection
  defense; injected on changed+diff pages when a goal is set and judging is
  enabled. Judge failure degrades gracefully without failing the scrape.
- Four change-tracking/judge Prometheus metrics; OpenAPI 3.1 + 3.0 specs.

Confidence is a string enum and meaningfulChanges are objects to match the
real Firecrawl wire shape (overrides the plan's f64 simplification).
5/5-reviewed plan, decision log, and sign-off for the Firecrawl-parity
/monitor feature spanning crw-opencore and crw-saas.
New crates/crw-monitor crate (OSS, AGPL-3.0) giving self-hosters monitoring
without forcing a DB on the default engine. OFF by default.

- SQLite (WAL) store: monitors, monitor_targets, snapshots, checks,
  check_pages (cascade); latest-snapshot-per-(monitor,url) + prior URL set.
- UTC scheduler (fixed-interval + 5-field cron, dependency-free) tick loop.
- runner: per-page diff via crw_diff, first-observation→new, set-level
  new/removed over the discovered URL set, >80% site-down gate, capped LLM
  judge (judge_max_pages_per_check) reusing crw_extract::judge.
- HMAC-SHA256 signed local webhook delivery (X-CRW-Signature).
- Gated behind crw-server's `monitor` feature (optional crw-monitor dep);
  rusqlite/hmac stay optional deps of crw-monitor only.

Open-core boundary gate VERIFIED: `cargo tree -p crw-server` (default) pulls
no rusqlite/hmac/crw-monitor; both default and --features monitor builds
compile; 23 crw-monitor tests pass; clippy clean.

Deferred (documented): SMTP email (HMAC webhook is the wired path) and the
`crw monitor` CLI / MCP tool surface.
Copilot AI review requested due to automatic review settings May 30, 2026 12:41
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

Preflight requires every workspace member be tiered or unpublished. crw-diff
publishes in tier 2 (crw-core only dep); crw-monitor in a new tier 4 (after
crw-crawl, before crw-server which optionally depends on it).
@us us merged commit dc432ce into main May 30, 2026
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants