Skip to content

feat(search): site= scoping filter + /admin/site-submit priority enqueue#20

Open
TeoSlayer wants to merge 9 commits into
mainfrom
feat/site-search-and-submit
Open

feat(search): site= scoping filter + /admin/site-submit priority enqueue#20
TeoSlayer wants to merge 9 commits into
mainfrom
feat/site-search-and-submit

Conversation

@TeoSlayer

Copy link
Copy Markdown
Contributor

Summary

Two operator-requested capabilities for cosift:

1. site= search scoping (host + path)

New site parameter on /search, /answer, and /research (GET query + POST body). Scopes results by host suffix AND an optional URL path prefix, e.g. ?site=pilotprotocol.network/docs. Segment-boundary path match (/docs matches /docs and /docs/x, not /docsearch); ANDs with the existing include_domains/exclude_domains.

2. POST /admin/site-submit — submit a whole site to the priority queue

Discovers a site's URLs (robots.txt Sitemap: directives → canonical/CMS fallbacks) and enqueues them all into the high-priority submitted lane by default (lane = priority|refresh|discovered|bulk). Backed by a new Crawler.SeedSitemapLane; SeedSitemap delegates to it (refresh lane, unchanged). Shared discoverSitemaps/normalizeBareHost helpers factored out of site-pack.

Tests

Scope parsing, host+path matching, lane mapping, sitemap discovery (httptest), an end-to-end /search?site= test against the populated fixture, site-submit auth/validation/lane wiring, and SeedSitemapLane lane placement. go build/go vet/gofmt clean.

Verified live (GH200, 13.2M-doc index)

  • Host scope and path scope on /search and /answer (e.g. /php → all /php, /dotnet → all /dotnet); negative host → 0 hits.
  • site-submit pilotprotocol.network → found robots sitemap, queued 207 URLs into the submitted lane; crawler drained them ahead of the 1.34M bulk backlog.

Note

This branch also carries a prior commit (adult-content filter + purge-adult command) that was already running on the GH200 but uncommitted; included here so the deployed binary is reproducible from history.

teovl added 9 commits June 14, 2026 09:40
Adds four lanes (submitted/refresh/discovered/bulk) so high-value URLs
from RSS, sitemaps, and publisher submissions jump the bulk-crawl
backlog instead of waiting behind 2.8M cloud.google.com pages. Default
weights 50/30/15/5; empty lanes donate their share to the next priority.

Wire format: 'f' + sub + lane + host + 0x00 + url for the lane-aware
secondary index. Lane byte (0..3) is below printable-ASCII so legacy
and lane-aware keys coexist; ClaimFrontier scans new format first then
falls back to the legacy 'f' + sub + host index so the existing 4.3M
queued URLs drain naturally without a synchronous migration.

frontierEntry gains a trailing Lane byte; missing bytes decode as
LaneDiscovered (2) so pre-lane rows keep working through transitions.

Per-lane round-robin cursor (laneCursors) and a monotonic lane tick
(laneTick) drive deterministic weighted RR — fair without per-call
randomness. Host-fairness preserved within each lane.

GetLaneStats walks both secondary indexes key-only and tallies per
lane; surfaced in /queue as a lanes[] block plus legacy_queued /
legacy_in_flight totals so operators can see whether the RR is
actually draining RSS ahead of bulk.

SeedRSS and SeedSitemap push to LaneRefresh and bypass allowedDomain:
the operator explicitly requested the feed/sitemap so its URLs are
trusted regardless of the curated include_domains list. Crawler
outbound-link discovery still goes through Seed (which defaults to
LaneDiscovered and respects allowedDomain) — so include_domains
continues to gate organic exploration as designed.

Backwards-compat notes:
- PushFrontier is a thin wrapper over PushFrontierLane(LaneDiscovered)
- transitionFrontier blind-deletes BOTH legacy and lane-aware keys,
  so completion/failure works for entries created in either era.
- SQLite Store gains a PushFrontierLane stub that ignores the lane
  (legacy schema has no lane column); production runs on Pebble.
PushFrontierBatch lets a caller insert N URLs in a single Pebble batch
and a single p.mu acquire. SeedRSS and SeedSitemap now buffer URLs and
flush via this path: a 25-URL reddit feed that previously took 8-17
minutes (one mu hop per URL, contending with 256 crawler workers) now
lands in milliseconds. Sitemap streaming flushes every 1024 URLs so a
100K-entry kubernetes.io sitemap doesn't hold the lock for the whole
parse.

DemoteHostToLane walks every queued URL for a host (across legacy AND
lane-aware indexes) and re-keys to a target lane. The escape hatch for
the cloud.google.com situation: 2.8M queued URLs on one host blocked
65% of the host-fair claim slots from fresher lanes. Re-keys atomically
in 1024-URL batches; skips URLs that flipped to in_flight under us.

New endpoint POST /admin/frontier-demote-host {host, lane} surfaces it.
Tested on the GH200: cloud.google.com → lane 3 moved 2,804,001 URLs
in 31 seconds (~90K rekeys/sec); steady crawl rate went from 79 to 134
docs/min on the next sample (+70%).
Optional embed worker pool drains a buffered channel separate from the
crawl-worker loop. Enabled when COSIFT_EMBED_DECOUPLE_WORKERS > 0:

  Crawler worker:  fetch → parse → UpsertDocument → IndexDocument →
                   push embedJob → claim next URL  (returns immediately)

  Embed worker:    embedJob → embedder.Embed → UpsertPassageBatch
                   (or per-chunk fallback when batch unavailable)

Pre-decouple, each crawler worker held onto a URL for fetch + parse +
BM25 + (Embed network call + HNSW writes for N chunks). With 512
workers contending on p.mu and the HNSW write lock, the synchronous
embed leg dominated per-cycle latency.

Bounded send (8K-default buffer): if the embed pool falls behind, the
hot path increments embedDropped and continues. The dropped docs land
in embed-backfill later, which the operator runs anyway. Counters
(embedQueued/Done/Failed/Dropped) logged on shutdown so we can verify
the pool kept up.

Closes the embed channel only after crawl workers exit so no producer
races a closed channel. Zombie-reclaim and per-host overrides preserved.
…InFlight

RecoverInFlight predates lanes — at every restart it deleted only the
LEGACY 'i' key and re-queued under the LEGACY 'q' key. Two consequences
that took a session to spot:

1. Stale lane-aware 'i' keys leaked one set per restart, eventually
   pushing GetLaneStats's in_flight count above max_concurrent (saw
   lane 1 if=891 with cap=512).
2. URLs that lived in lane 1/2/3 silently reverted to the legacy
   queue on every recovery, so the lane infrastructure's gains
   melted away across restarts.

Recovery now: blind-deletes both legacy and lane-aware 'i' keys (mirrors
transitionFrontier), then re-queues at the entry's own Lane so recovered
work stays in its priority class.

PurgeStaleInFlight + POST /admin/frontier-purge-stale-inflight is the
one-shot sweep for pre-fix leftovers: walks all 'f'+'i'+... keys and
drops any without a matching primary in InFlight. Ran on GH200 after
deploy — purged 783 keys, lane 1 in_flight dropped from 891 → 239.

Also adds COSIFT_EMBED_DECOUPLE_WORKERS / _BUFFER plumbing (Crawler
embed pool + buffered channel) — committed in a prior change but the
recovery bug was making it look like a regression. Live testing on the
clean indexes is the right way to actually measure its impact.
The hot path was taking p.mu THREE times per finished doc — Upsert,
Index, Complete — each one queueing 512 workers in a single global lock
that took 5-15ms per round-trip. At sustained crawl load that's a
synchronous bottleneck no amount of worker concurrency could break.

PebbleStore.WriteCrawlResult folds all three operations into ONE mu
acquire + ONE batch commit:
  - Tokenize runs OUTSIDE the lock (CPU-parallel, no shared state)
  - Inside the lock: ID resolution, BM25 postings prep, frontier
    in_flight→Done transition
  - Single batch.Commit at the end

CrawlResultWriter interface is optional: stores that don't implement
it (SQLite, mocks) fall back to the three-call legacy path
automatically. PebbleStore satisfies it; in-serve crawler picks it up
via type assertion in processClaimed.

To avoid a redundant CompleteFrontier in the worker loop after
WriteCrawlResult already did it, processClaimed marks the URL in a
small completedInlineSet; the worker loop consumes-and-deletes the
marker before deciding whether to call its own Complete. Single
sync.Map operation per cycle — far cheaper than the mu round-trip
this replaces.

Expected effect: per-worker cycle time should drop by ~50% (mu hops
were ~60% of the per-cycle non-network time per pprof), letting the
existing 512-worker cap translate into proportionally higher doc/min
throughput.
PurgeFrontierByHost was lane-blind — it only walked the legacy
'f'+'q'+host+0x00+url index, silently missing the lane-aware
'f'+'q'+lane+host+0x00+url range. On the GH200 this meant the
admin/frontier-purge-host endpoint returned "purged: 291" for
cloud.google.com when 2.8M URLs were actually queued. Fixed: the
purger now walks the legacy range AND every lane's range, so demoted
hosts can actually be purged. Verified live: re-purge of
cloud.google.com after the fix dropped 3,092,546 URLs.

hostSweeperLoop is the new self-cleaning background goroutine — wakes
every 10 min (configurable via COSIFT_HOSTSWEEP_INTERVAL_SEC),
walks the existing hostStats sync.Map, and acts on hosts with
COSIFT_HOSTSWEEP_MIN_ATTEMPTS (default 100) recorded attempts:

  success_rate < COSIFT_HOSTSWEEP_DEAD_RATE (default 0.20)
    → PurgeFrontierByHost + add to autoBlocked sync.Map so future
      link discovery skips the host entirely

  COSIFT_HOSTSWEEP_DEAD_RATE ≤ rate < COSIFT_HOSTSWEEP_WEAK_RATE
  (default 0.50)
    → DemoteHostToLane(LaneBulk) so the host's URLs keep draining
      but at the 5%-weight bulk lane instead of crowding lanes 1/2

Live confirmation: within 10 min of going live, the sweeper detected
448,028 newly-discovered cloud.google.com URLs (success rate 0.21)
and demoted them to lane 3. Eliminates the manual
/admin/frontier-purge-host operator workflow.

Optional surfaces (HostFrontierPurger, HostFrontierDemoter) on the
store interface keep the SQLite legacy backend a no-op for these.
Adds an adult/spam classifier (host+TLD match plus >=2 distinct body-term
threshold) gated behind crawler.filter_adult, wired into the crawl pipeline,
plus a purge-adult command to sweep already-indexed adult/spam docs with a
safety gate on the match fraction.
Search:
- Add a 'site' parameter to /search, /answer and /research (GET query +
  POST body) that scopes results by host suffix AND optional URL path
  prefix, e.g. site=pilotprotocol.network/docs. Segment-boundary path
  match; ANDs with include_domains/exclude_domains.

Crawl:
- Add Crawler.SeedSitemapLane so sitemap URLs can be enqueued into a
  chosen frontier lane; SeedSitemap now delegates (refresh lane, unchanged).
- Add POST /admin/site-submit: discover a site's URLs (robots.txt
  Sitemap: directives, then canonical/CMS fallbacks) and enqueue them all
  into the high-priority submitted lane by default (lane configurable).
- Factor shared discoverSitemaps/normalizeBareHost helpers out of site-pack.

Tests: scope parsing, host+path matching, lane mapping, sitemap discovery,
an end-to-end /search?site= test, site-submit auth/validation/lane wiring,
and SeedSitemapLane lane placement.
Pure host-suffix sweep (dot-boundary) over the corpus: -suffix cfd,sbs
soft-deletes every *.cfd and *.sbs doc regardless of content. Companion to
the crawler exclude_domains blacklist (which stops new ones) for clearing an
already-indexed backlog. Dry-run by default; -apply to delete; -readonly to
report alongside a live serve. Mirrors purge-adult's soft-delete + histogram
report; reuses matchesAnyDomain for dot-boundary matching.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants