- site-level comments not yet recorded elsewhere:
firsttimedogmom.com: appears AI-like, has 2023 comments, modified 2025tips.simplygoodstuff.com: published 2007, modified 2024www.greenemath.com: possible false positive; generic math-teaching pages with YouTube-style descriptions
- after removing no-H1 pages, little H1-level separation remained between Bing LLM and human sites
- feature ideas not yet tracked in feature files:
- heading-count distribution by level and markup ratio
- word-frequency features for explainability
- ops/dev incidents not previously logged:
- missing logging for failures before HTTP request starts
- year-long bug where forked subprocesses did not share memory correctly
- CC local-fetch bottleneck from EC2 workers stealing work
- EC2 clients crashing on disconnect
- TODO: revisit older spam-account literature for framing:
- UCSD Click Trajectory email-spam work
- SybilGuard-style spam-account work
- corrected assumption that 100k crawl had completed; rerun initiated
- data-analysis incidents not previously documented:
- OOM in Pandas path -> DuckDB switch -> DuckDB segfault -> fix
- cached DuckDB tables on disk for faster reloads
- page-feature release references:
https://github.com/SichangHe/DeGenTWeb_docs/releases/tag/data-20260323-cc-page-feathttps://github.com/SichangHe/DeGenTWeb_docs/releases/tag/data-20260323-search-page-feat
- implementation workflow details:
- LLM categorization seed categories/descriptions created
- workflow: sample page -> LLM picks/proposes category -> pause for human review when category is new
- pending analysis TODOs:
- inspect most-different plots/pages
- compare Webis first vs last tar slices
- Exxact outage impact details:
- no DB access prevented analysis work
- about 3TB estimated for another DB instance from backup
- follow-up TODOs:
- CDFs for all metric pairs
- write 6-page version and literature review update
- create dump of all surely-LLM sites for Calvin
- sampling-scale details:
- about 70k total sampled subdomains at this point
- some search terms appeared to trigger more LLM-site results
- review-process note:
- manual review of selected boundary sites was pending
- infra/task detail not yet in execution notes:
- TODO: run Binoculars with NVFP4 on AWS 6000 Blackwell (half done)
- experimentation/cost notes:
- OpenCode/OpenClaw required heavy token budgets
- AWS Bedrock experiment incurred immediate cost (~$40)
- OpenClaw failed to launch headless browser; Browser Use CLI worked
- trend detail:
- AI rate in 10k Bing-search-result sites rose from 11.9% to 15.5%
- reliability work details:
- process pool kills timed-out workers to avoid repeated stdlib pooling bugs
- AWS CC S3 service used trained zstd dictionary
- infrastructure details:
- Postgres killed by EarlyOOM
- smaller Postgres tables moved to SSD because HDD writes were too slow
- Exxact OOM and CPU saturation tied to malfunctioning colmap
- dev details:
- automatic GPU-selection logic added
- CC over-crawling fixed
- review UI improved
- TODOs:
- investigate free-trial-generated sites and social-media promotion patterns
- add CC sites to baseline anyway
- crawl more pages from non-unimodal sites for bimodal checks
- filter-calibration detail:
- dropping 50% no-punctuation filter kept baseline acceptable
- approximately 9 pre-ChatGPT false-positive subdomains supported keeping the change
- architecture detail:
- moved HTTP client/browser controller/CC downloader to actor-model style
- crawl-data consistency details:
- subdomain/URL mismatch bug tracked in some crawls
- sitemap links crossing subdomains were dropped as a temporary rule
- baseline-size calibration detail:
- around 14 pages/site converged, with 12 pages very close
- poster-feedback details:
- gpt-neo-2.7B may outperform Falcon-7B in one reproduction slide deck
- sampling decoding plus repetition penalty can break detector behavior
-
date-cue examples (
galaxy.ai,cs2.kinguin.net) are inbaseline_sites.md -
image/video/lighthouse TODOs are in
execution.mdandllm_site_features.md -
many bimodal site examples are in
bimodal.md -
Bing 500 WikiHow examples are in
preliminary_binoculars_eval.md -
Google layout parsing, WikiHow dataset, and Google Trends notes are in
web_search.md,wikihow.md, andgoogle_trends.md -
elsewhere
- caveat: some AI-heavy sites use fake old publication years;
prefer
last-modifiedwhen available (e.g.,galaxy.ai,cs2.kinguin.net)
- caveat: some AI-heavy sites use fake old publication years;
prefer