feat(search): site= scoping filter + /admin/site-submit priority enqueue#20
Open
TeoSlayer wants to merge 9 commits into
Open
feat(search): site= scoping filter + /admin/site-submit priority enqueue#20TeoSlayer wants to merge 9 commits into
TeoSlayer wants to merge 9 commits into
Conversation
Adds four lanes (submitted/refresh/discovered/bulk) so high-value URLs from RSS, sitemaps, and publisher submissions jump the bulk-crawl backlog instead of waiting behind 2.8M cloud.google.com pages. Default weights 50/30/15/5; empty lanes donate their share to the next priority. Wire format: 'f' + sub + lane + host + 0x00 + url for the lane-aware secondary index. Lane byte (0..3) is below printable-ASCII so legacy and lane-aware keys coexist; ClaimFrontier scans new format first then falls back to the legacy 'f' + sub + host index so the existing 4.3M queued URLs drain naturally without a synchronous migration. frontierEntry gains a trailing Lane byte; missing bytes decode as LaneDiscovered (2) so pre-lane rows keep working through transitions. Per-lane round-robin cursor (laneCursors) and a monotonic lane tick (laneTick) drive deterministic weighted RR — fair without per-call randomness. Host-fairness preserved within each lane. GetLaneStats walks both secondary indexes key-only and tallies per lane; surfaced in /queue as a lanes[] block plus legacy_queued / legacy_in_flight totals so operators can see whether the RR is actually draining RSS ahead of bulk. SeedRSS and SeedSitemap push to LaneRefresh and bypass allowedDomain: the operator explicitly requested the feed/sitemap so its URLs are trusted regardless of the curated include_domains list. Crawler outbound-link discovery still goes through Seed (which defaults to LaneDiscovered and respects allowedDomain) — so include_domains continues to gate organic exploration as designed. Backwards-compat notes: - PushFrontier is a thin wrapper over PushFrontierLane(LaneDiscovered) - transitionFrontier blind-deletes BOTH legacy and lane-aware keys, so completion/failure works for entries created in either era. - SQLite Store gains a PushFrontierLane stub that ignores the lane (legacy schema has no lane column); production runs on Pebble.
PushFrontierBatch lets a caller insert N URLs in a single Pebble batch
and a single p.mu acquire. SeedRSS and SeedSitemap now buffer URLs and
flush via this path: a 25-URL reddit feed that previously took 8-17
minutes (one mu hop per URL, contending with 256 crawler workers) now
lands in milliseconds. Sitemap streaming flushes every 1024 URLs so a
100K-entry kubernetes.io sitemap doesn't hold the lock for the whole
parse.
DemoteHostToLane walks every queued URL for a host (across legacy AND
lane-aware indexes) and re-keys to a target lane. The escape hatch for
the cloud.google.com situation: 2.8M queued URLs on one host blocked
65% of the host-fair claim slots from fresher lanes. Re-keys atomically
in 1024-URL batches; skips URLs that flipped to in_flight under us.
New endpoint POST /admin/frontier-demote-host {host, lane} surfaces it.
Tested on the GH200: cloud.google.com → lane 3 moved 2,804,001 URLs
in 31 seconds (~90K rekeys/sec); steady crawl rate went from 79 to 134
docs/min on the next sample (+70%).
Optional embed worker pool drains a buffered channel separate from the
crawl-worker loop. Enabled when COSIFT_EMBED_DECOUPLE_WORKERS > 0:
Crawler worker: fetch → parse → UpsertDocument → IndexDocument →
push embedJob → claim next URL (returns immediately)
Embed worker: embedJob → embedder.Embed → UpsertPassageBatch
(or per-chunk fallback when batch unavailable)
Pre-decouple, each crawler worker held onto a URL for fetch + parse +
BM25 + (Embed network call + HNSW writes for N chunks). With 512
workers contending on p.mu and the HNSW write lock, the synchronous
embed leg dominated per-cycle latency.
Bounded send (8K-default buffer): if the embed pool falls behind, the
hot path increments embedDropped and continues. The dropped docs land
in embed-backfill later, which the operator runs anyway. Counters
(embedQueued/Done/Failed/Dropped) logged on shutdown so we can verify
the pool kept up.
Closes the embed channel only after crawl workers exit so no producer
races a closed channel. Zombie-reclaim and per-host overrides preserved.
…InFlight RecoverInFlight predates lanes — at every restart it deleted only the LEGACY 'i' key and re-queued under the LEGACY 'q' key. Two consequences that took a session to spot: 1. Stale lane-aware 'i' keys leaked one set per restart, eventually pushing GetLaneStats's in_flight count above max_concurrent (saw lane 1 if=891 with cap=512). 2. URLs that lived in lane 1/2/3 silently reverted to the legacy queue on every recovery, so the lane infrastructure's gains melted away across restarts. Recovery now: blind-deletes both legacy and lane-aware 'i' keys (mirrors transitionFrontier), then re-queues at the entry's own Lane so recovered work stays in its priority class. PurgeStaleInFlight + POST /admin/frontier-purge-stale-inflight is the one-shot sweep for pre-fix leftovers: walks all 'f'+'i'+... keys and drops any without a matching primary in InFlight. Ran on GH200 after deploy — purged 783 keys, lane 1 in_flight dropped from 891 → 239. Also adds COSIFT_EMBED_DECOUPLE_WORKERS / _BUFFER plumbing (Crawler embed pool + buffered channel) — committed in a prior change but the recovery bug was making it look like a regression. Live testing on the clean indexes is the right way to actually measure its impact.
The hot path was taking p.mu THREE times per finished doc — Upsert,
Index, Complete — each one queueing 512 workers in a single global lock
that took 5-15ms per round-trip. At sustained crawl load that's a
synchronous bottleneck no amount of worker concurrency could break.
PebbleStore.WriteCrawlResult folds all three operations into ONE mu
acquire + ONE batch commit:
- Tokenize runs OUTSIDE the lock (CPU-parallel, no shared state)
- Inside the lock: ID resolution, BM25 postings prep, frontier
in_flight→Done transition
- Single batch.Commit at the end
CrawlResultWriter interface is optional: stores that don't implement
it (SQLite, mocks) fall back to the three-call legacy path
automatically. PebbleStore satisfies it; in-serve crawler picks it up
via type assertion in processClaimed.
To avoid a redundant CompleteFrontier in the worker loop after
WriteCrawlResult already did it, processClaimed marks the URL in a
small completedInlineSet; the worker loop consumes-and-deletes the
marker before deciding whether to call its own Complete. Single
sync.Map operation per cycle — far cheaper than the mu round-trip
this replaces.
Expected effect: per-worker cycle time should drop by ~50% (mu hops
were ~60% of the per-cycle non-network time per pprof), letting the
existing 512-worker cap translate into proportionally higher doc/min
throughput.
PurgeFrontierByHost was lane-blind — it only walked the legacy
'f'+'q'+host+0x00+url index, silently missing the lane-aware
'f'+'q'+lane+host+0x00+url range. On the GH200 this meant the
admin/frontier-purge-host endpoint returned "purged: 291" for
cloud.google.com when 2.8M URLs were actually queued. Fixed: the
purger now walks the legacy range AND every lane's range, so demoted
hosts can actually be purged. Verified live: re-purge of
cloud.google.com after the fix dropped 3,092,546 URLs.
hostSweeperLoop is the new self-cleaning background goroutine — wakes
every 10 min (configurable via COSIFT_HOSTSWEEP_INTERVAL_SEC),
walks the existing hostStats sync.Map, and acts on hosts with
COSIFT_HOSTSWEEP_MIN_ATTEMPTS (default 100) recorded attempts:
success_rate < COSIFT_HOSTSWEEP_DEAD_RATE (default 0.20)
→ PurgeFrontierByHost + add to autoBlocked sync.Map so future
link discovery skips the host entirely
COSIFT_HOSTSWEEP_DEAD_RATE ≤ rate < COSIFT_HOSTSWEEP_WEAK_RATE
(default 0.50)
→ DemoteHostToLane(LaneBulk) so the host's URLs keep draining
but at the 5%-weight bulk lane instead of crowding lanes 1/2
Live confirmation: within 10 min of going live, the sweeper detected
448,028 newly-discovered cloud.google.com URLs (success rate 0.21)
and demoted them to lane 3. Eliminates the manual
/admin/frontier-purge-host operator workflow.
Optional surfaces (HostFrontierPurger, HostFrontierDemoter) on the
store interface keep the SQLite legacy backend a no-op for these.
Adds an adult/spam classifier (host+TLD match plus >=2 distinct body-term threshold) gated behind crawler.filter_adult, wired into the crawl pipeline, plus a purge-adult command to sweep already-indexed adult/spam docs with a safety gate on the match fraction.
Search: - Add a 'site' parameter to /search, /answer and /research (GET query + POST body) that scopes results by host suffix AND optional URL path prefix, e.g. site=pilotprotocol.network/docs. Segment-boundary path match; ANDs with include_domains/exclude_domains. Crawl: - Add Crawler.SeedSitemapLane so sitemap URLs can be enqueued into a chosen frontier lane; SeedSitemap now delegates (refresh lane, unchanged). - Add POST /admin/site-submit: discover a site's URLs (robots.txt Sitemap: directives, then canonical/CMS fallbacks) and enqueue them all into the high-priority submitted lane by default (lane configurable). - Factor shared discoverSitemaps/normalizeBareHost helpers out of site-pack. Tests: scope parsing, host+path matching, lane mapping, sitemap discovery, an end-to-end /search?site= test, site-submit auth/validation/lane wiring, and SeedSitemapLane lane placement.
Pure host-suffix sweep (dot-boundary) over the corpus: -suffix cfd,sbs soft-deletes every *.cfd and *.sbs doc regardless of content. Companion to the crawler exclude_domains blacklist (which stops new ones) for clearing an already-indexed backlog. Dry-run by default; -apply to delete; -readonly to report alongside a live serve. Mirrors purge-adult's soft-delete + histogram report; reuses matchesAnyDomain for dot-boundary matching.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Two operator-requested capabilities for cosift:
1.
site=search scoping (host + path)New
siteparameter on/search,/answer, and/research(GET query + POST body). Scopes results by host suffix AND an optional URL path prefix, e.g.?site=pilotprotocol.network/docs. Segment-boundary path match (/docsmatches/docsand/docs/x, not/docsearch); ANDs with the existinginclude_domains/exclude_domains.2.
POST /admin/site-submit— submit a whole site to the priority queueDiscovers a site's URLs (robots.txt
Sitemap:directives → canonical/CMS fallbacks) and enqueues them all into the high-priority submitted lane by default (lane= priority|refresh|discovered|bulk). Backed by a newCrawler.SeedSitemapLane;SeedSitemapdelegates to it (refresh lane, unchanged). ShareddiscoverSitemaps/normalizeBareHosthelpers factored out ofsite-pack.Tests
Scope parsing, host+path matching, lane mapping, sitemap discovery (httptest), an end-to-end
/search?site=test against the populated fixture,site-submitauth/validation/lane wiring, andSeedSitemapLanelane placement.go build/go vet/gofmt clean.Verified live (GH200, 13.2M-doc index)
/searchand/answer(e.g./php→ all/php,/dotnet→ all/dotnet); negative host → 0 hits.site-submit pilotprotocol.network→ found robots sitemap, queued 207 URLs into the submitted lane; crawler drained them ahead of the 1.34M bulk backlog.Note
This branch also carries a prior commit (
adult-content filter + purge-adult command) that was already running on the GH200 but uncommitted; included here so the deployed binary is reproducible from history.