feat(search): site= scoping filter + /admin/site-submit priority enqueue by TeoSlayer · Pull Request #20 · pilot-protocol/cosift

TeoSlayer · 2026-06-15T15:43:52Z

Summary

Two operator-requested capabilities for cosift:

1. `site=` search scoping (host + path)

New site parameter on /search, /answer, and /research (GET query + POST body). Scopes results by host suffix AND an optional URL path prefix, e.g. ?site=pilotprotocol.network/docs. Segment-boundary path match (/docs matches /docs and /docs/x, not /docsearch); ANDs with the existing include_domains/exclude_domains.

2. `POST /admin/site-submit` — submit a whole site to the priority queue

Discovers a site's URLs (robots.txt Sitemap: directives → canonical/CMS fallbacks) and enqueues them all into the high-priority submitted lane by default (lane = priority|refresh|discovered|bulk). Backed by a new Crawler.SeedSitemapLane; SeedSitemap delegates to it (refresh lane, unchanged). Shared discoverSitemaps/normalizeBareHost helpers factored out of site-pack.

Tests

Scope parsing, host+path matching, lane mapping, sitemap discovery (httptest), an end-to-end /search?site= test against the populated fixture, site-submit auth/validation/lane wiring, and SeedSitemapLane lane placement. go build/go vet/gofmt clean.

Verified live (GH200, 13.2M-doc index)

Host scope and path scope on /search and /answer (e.g. /php → all /php, /dotnet → all /dotnet); negative host → 0 hits.
site-submit pilotprotocol.network → found robots sitemap, queued 207 URLs into the submitted lane; crawler drained them ahead of the 1.34M bulk backlog.

Note

This branch also carries a prior commit (adult-content filter + purge-adult command) that was already running on the GH200 but uncommitted; included here so the deployed binary is reproducible from history.

Adds four lanes (submitted/refresh/discovered/bulk) so high-value URLs from RSS, sitemaps, and publisher submissions jump the bulk-crawl backlog instead of waiting behind 2.8M cloud.google.com pages. Default weights 50/30/15/5; empty lanes donate their share to the next priority. Wire format: 'f' + sub + lane + host + 0x00 + url for the lane-aware secondary index. Lane byte (0..3) is below printable-ASCII so legacy and lane-aware keys coexist; ClaimFrontier scans new format first then falls back to the legacy 'f' + sub + host index so the existing 4.3M queued URLs drain naturally without a synchronous migration. frontierEntry gains a trailing Lane byte; missing bytes decode as LaneDiscovered (2) so pre-lane rows keep working through transitions. Per-lane round-robin cursor (laneCursors) and a monotonic lane tick (laneTick) drive deterministic weighted RR — fair without per-call randomness. Host-fairness preserved within each lane. GetLaneStats walks both secondary indexes key-only and tallies per lane; surfaced in /queue as a lanes[] block plus legacy_queued / legacy_in_flight totals so operators can see whether the RR is actually draining RSS ahead of bulk. SeedRSS and SeedSitemap push to LaneRefresh and bypass allowedDomain: the operator explicitly requested the feed/sitemap so its URLs are trusted regardless of the curated include_domains list. Crawler outbound-link discovery still goes through Seed (which defaults to LaneDiscovered and respects allowedDomain) — so include_domains continues to gate organic exploration as designed. Backwards-compat notes: - PushFrontier is a thin wrapper over PushFrontierLane(LaneDiscovered) - transitionFrontier blind-deletes BOTH legacy and lane-aware keys, so completion/failure works for entries created in either era. - SQLite Store gains a PushFrontierLane stub that ignores the lane (legacy schema has no lane column); production runs on Pebble.

PushFrontierBatch lets a caller insert N URLs in a single Pebble batch and a single p.mu acquire. SeedRSS and SeedSitemap now buffer URLs and flush via this path: a 25-URL reddit feed that previously took 8-17 minutes (one mu hop per URL, contending with 256 crawler workers) now lands in milliseconds. Sitemap streaming flushes every 1024 URLs so a 100K-entry kubernetes.io sitemap doesn't hold the lock for the whole parse. DemoteHostToLane walks every queued URL for a host (across legacy AND lane-aware indexes) and re-keys to a target lane. The escape hatch for the cloud.google.com situation: 2.8M queued URLs on one host blocked 65% of the host-fair claim slots from fresher lanes. Re-keys atomically in 1024-URL batches; skips URLs that flipped to in_flight under us. New endpoint POST /admin/frontier-demote-host {host, lane} surfaces it. Tested on the GH200: cloud.google.com → lane 3 moved 2,804,001 URLs in 31 seconds (~90K rekeys/sec); steady crawl rate went from 79 to 134 docs/min on the next sample (+70%).

Optional embed worker pool drains a buffered channel separate from the crawl-worker loop. Enabled when COSIFT_EMBED_DECOUPLE_WORKERS > 0: Crawler worker: fetch → parse → UpsertDocument → IndexDocument → push embedJob → claim next URL (returns immediately) Embed worker: embedJob → embedder.Embed → UpsertPassageBatch (or per-chunk fallback when batch unavailable) Pre-decouple, each crawler worker held onto a URL for fetch + parse + BM25 + (Embed network call + HNSW writes for N chunks). With 512 workers contending on p.mu and the HNSW write lock, the synchronous embed leg dominated per-cycle latency. Bounded send (8K-default buffer): if the embed pool falls behind, the hot path increments embedDropped and continues. The dropped docs land in embed-backfill later, which the operator runs anyway. Counters (embedQueued/Done/Failed/Dropped) logged on shutdown so we can verify the pool kept up. Closes the embed channel only after crawl workers exit so no producer races a closed channel. Zombie-reclaim and per-host overrides preserved.

…InFlight RecoverInFlight predates lanes — at every restart it deleted only the LEGACY 'i' key and re-queued under the LEGACY 'q' key. Two consequences that took a session to spot: 1. Stale lane-aware 'i' keys leaked one set per restart, eventually pushing GetLaneStats's in_flight count above max_concurrent (saw lane 1 if=891 with cap=512). 2. URLs that lived in lane 1/2/3 silently reverted to the legacy queue on every recovery, so the lane infrastructure's gains melted away across restarts. Recovery now: blind-deletes both legacy and lane-aware 'i' keys (mirrors transitionFrontier), then re-queues at the entry's own Lane so recovered work stays in its priority class. PurgeStaleInFlight + POST /admin/frontier-purge-stale-inflight is the one-shot sweep for pre-fix leftovers: walks all 'f'+'i'+... keys and drops any without a matching primary in InFlight. Ran on GH200 after deploy — purged 783 keys, lane 1 in_flight dropped from 891 → 239. Also adds COSIFT_EMBED_DECOUPLE_WORKERS / _BUFFER plumbing (Crawler embed pool + buffered channel) — committed in a prior change but the recovery bug was making it look like a regression. Live testing on the clean indexes is the right way to actually measure its impact.

The hot path was taking p.mu THREE times per finished doc — Upsert, Index, Complete — each one queueing 512 workers in a single global lock that took 5-15ms per round-trip. At sustained crawl load that's a synchronous bottleneck no amount of worker concurrency could break. PebbleStore.WriteCrawlResult folds all three operations into ONE mu acquire + ONE batch commit: - Tokenize runs OUTSIDE the lock (CPU-parallel, no shared state) - Inside the lock: ID resolution, BM25 postings prep, frontier in_flight→Done transition - Single batch.Commit at the end CrawlResultWriter interface is optional: stores that don't implement it (SQLite, mocks) fall back to the three-call legacy path automatically. PebbleStore satisfies it; in-serve crawler picks it up via type assertion in processClaimed. To avoid a redundant CompleteFrontier in the worker loop after WriteCrawlResult already did it, processClaimed marks the URL in a small completedInlineSet; the worker loop consumes-and-deletes the marker before deciding whether to call its own Complete. Single sync.Map operation per cycle — far cheaper than the mu round-trip this replaces. Expected effect: per-worker cycle time should drop by ~50% (mu hops were ~60% of the per-cycle non-network time per pprof), letting the existing 512-worker cap translate into proportionally higher doc/min throughput.

PurgeFrontierByHost was lane-blind — it only walked the legacy 'f'+'q'+host+0x00+url index, silently missing the lane-aware 'f'+'q'+lane+host+0x00+url range. On the GH200 this meant the admin/frontier-purge-host endpoint returned "purged: 291" for cloud.google.com when 2.8M URLs were actually queued. Fixed: the purger now walks the legacy range AND every lane's range, so demoted hosts can actually be purged. Verified live: re-purge of cloud.google.com after the fix dropped 3,092,546 URLs. hostSweeperLoop is the new self-cleaning background goroutine — wakes every 10 min (configurable via COSIFT_HOSTSWEEP_INTERVAL_SEC), walks the existing hostStats sync.Map, and acts on hosts with COSIFT_HOSTSWEEP_MIN_ATTEMPTS (default 100) recorded attempts: success_rate < COSIFT_HOSTSWEEP_DEAD_RATE (default 0.20) → PurgeFrontierByHost + add to autoBlocked sync.Map so future link discovery skips the host entirely COSIFT_HOSTSWEEP_DEAD_RATE ≤ rate < COSIFT_HOSTSWEEP_WEAK_RATE (default 0.50) → DemoteHostToLane(LaneBulk) so the host's URLs keep draining but at the 5%-weight bulk lane instead of crowding lanes 1/2 Live confirmation: within 10 min of going live, the sweeper detected 448,028 newly-discovered cloud.google.com URLs (success rate 0.21) and demoted them to lane 3. Eliminates the manual /admin/frontier-purge-host operator workflow. Optional surfaces (HostFrontierPurger, HostFrontierDemoter) on the store interface keep the SQLite legacy backend a no-op for these.

Adds an adult/spam classifier (host+TLD match plus >=2 distinct body-term threshold) gated behind crawler.filter_adult, wired into the crawl pipeline, plus a purge-adult command to sweep already-indexed adult/spam docs with a safety gate on the match fraction.

Search: - Add a 'site' parameter to /search, /answer and /research (GET query + POST body) that scopes results by host suffix AND optional URL path prefix, e.g. site=pilotprotocol.network/docs. Segment-boundary path match; ANDs with include_domains/exclude_domains. Crawl: - Add Crawler.SeedSitemapLane so sitemap URLs can be enqueued into a chosen frontier lane; SeedSitemap now delegates (refresh lane, unchanged). - Add POST /admin/site-submit: discover a site's URLs (robots.txt Sitemap: directives, then canonical/CMS fallbacks) and enqueue them all into the high-priority submitted lane by default (lane configurable). - Factor shared discoverSitemaps/normalizeBareHost helpers out of site-pack. Tests: scope parsing, host+path matching, lane mapping, sitemap discovery, an end-to-end /search?site= test, site-submit auth/validation/lane wiring, and SeedSitemapLane lane placement.

Pure host-suffix sweep (dot-boundary) over the corpus: -suffix cfd,sbs soft-deletes every *.cfd and *.sbs doc regardless of content. Companion to the crawler exclude_domains blacklist (which stops new ones) for clearing an already-indexed backlog. Dry-run by default; -apply to delete; -readonly to report alongside a live serve. Mirrors purge-adult's soft-delete + histogram report; reuses matchesAnyDomain for dot-boundary matching.

teovl added 9 commits June 14, 2026 09:40

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(search): site= scoping filter + /admin/site-submit priority enqueue#20

feat(search): site= scoping filter + /admin/site-submit priority enqueue#20
TeoSlayer wants to merge 9 commits into
mainfrom
feat/site-search-and-submit

TeoSlayer commented Jun 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

TeoSlayer commented Jun 15, 2026

Summary

1. site= search scoping (host + path)

2. POST /admin/site-submit — submit a whole site to the priority queue

Tests

Verified live (GH200, 13.2M-doc index)

Note

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

1. `site=` search scoping (host + path)

2. `POST /admin/site-submit` — submit a whole site to the priority queue