Pareto-optimal models for cleaning the web — fast, encoder-based main-content extraction from HTML.
nlp encoder html-to-markdown transformers web-scraping content-extraction html-extraction boilerplate-removal rag main-content-extraction
-
Updated
Jul 1, 2026 - HTML