feat(monitor): stateless change-tracking diff engine + LLM judge + self-host monitor#65
Merged
Conversation
Introduces the opencore primitives for the /monitor feature (M1+M2): - New crw-diff crate: pure, stateless diff over a caller-supplied previous snapshot. Markdown git-diff (unified text + parse-diff AST from one similar op-stream), JSON per-field path diff, mixed mode, binary/non-text hashing, mode-aware content hash, and a max-diff-changes truncation cap. - crw-core types: OutputFormat::ChangeTracking (string variant, change-tracking alias), ChangeTrackingOptions/Snapshot/Result, ChangeDiff, DiffAst, and the ChangeJudgment wire shape (confidence low|medium|high, meaningfulChanges[]). ScrapeData gains content_type + change_tracking; ScrapeRequest gains change_tracking, goal, judge_enabled. - Scrape-path wiring in single.rs (json-mode extraction) and content_type on crawl pages. - POST /v1/change-tracking/diff (single + batch, presence-of-batch discriminator) and changeTracking advertised in /v1/capabilities. - LLM meaningful-change judge (crw-extract/judge.rs) reusing the structured provider machinery with a fixed schema and UNTRUSTED-delimiter injection defense; injected on changed+diff pages when a goal is set and judging is enabled. Judge failure degrades gracefully without failing the scrape. - Four change-tracking/judge Prometheus metrics; OpenAPI 3.1 + 3.0 specs. Confidence is a string enum and meaningfulChanges are objects to match the real Firecrawl wire shape (overrides the plan's f64 simplification).
5/5-reviewed plan, decision log, and sign-off for the Firecrawl-parity /monitor feature spanning crw-opencore and crw-saas.
New crates/crw-monitor crate (OSS, AGPL-3.0) giving self-hosters monitoring without forcing a DB on the default engine. OFF by default. - SQLite (WAL) store: monitors, monitor_targets, snapshots, checks, check_pages (cascade); latest-snapshot-per-(monitor,url) + prior URL set. - UTC scheduler (fixed-interval + 5-field cron, dependency-free) tick loop. - runner: per-page diff via crw_diff, first-observation→new, set-level new/removed over the discovered URL set, >80% site-down gate, capped LLM judge (judge_max_pages_per_check) reusing crw_extract::judge. - HMAC-SHA256 signed local webhook delivery (X-CRW-Signature). - Gated behind crw-server's `monitor` feature (optional crw-monitor dep); rusqlite/hmac stay optional deps of crw-monitor only. Open-core boundary gate VERIFIED: `cargo tree -p crw-server` (default) pulls no rusqlite/hmac/crw-monitor; both default and --features monitor builds compile; 23 crw-monitor tests pass; clippy clean. Deferred (documented): SMTP email (HMAC webhook is the wired path) and the `crw monitor` CLI / MCP tool surface.
Preflight requires every workspace member be tiered or unpublished. crw-diff publishes in tier 2 (crw-core only dep); crw-monitor in a new tier 4 (after crw-crawl, before crw-server which optionally depends on it).
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Implements the opencore side of the
/monitorfeature (M1, M2, M6) perplans/MONITOR_PLAN.md.crw-diffcrate (markdown git-diff + parse-diff AST from onesimilarop-stream, JSON per-field diff, mixed mode, binary hashing, mode-aware content hash); core types (OutputFormat::ChangeTracking, ChangeTracking* + ChangeJudgment); scrape-path wiring +POST /v1/change-tracking/diff(single+batch);/v1/capabilities; 4 metrics; OpenAPI 3.1/3.0.crw-extract/judge.rsLLM meaningful-change judge (fixed schema, UNTRUSTED-delimiter injection defense) injected on changed pages.crw-monitorself-host crate (SQLite, UTC scheduler, set-level new/removed, capped judge, HMAC webhook), OFF by default.Verification: whole workspace compiles,
clippy -D warningsclean, 73 tests pass (crw-diff 18 + crw-core 19 + judge 3 + change-tracking integration 10 + crw-monitor 23). Open-core boundary gate verified: defaultcargo tree -p crw-serverpulls no rusqlite/hmac/crw-monitor. Wire shapes (confidence string enum, meaningfulChanges objects) match real Firecrawl.Deferred: self-host SMTP +
crw monitorCLI/MCP surface.