From b1d55ad0b78798feef312a1ea794725c722560c2 Mon Sep 17 00:00:00 2001 From: "Bhavnick @ Chonkie" Date: Mon, 29 Jun 2026 21:54:24 -0700 Subject: [PATCH 1/3] README footer: built by Feyn --- README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/README.md b/README.md index 67d9d80..4f3b8f8 100644 --- a/README.md +++ b/README.md @@ -153,5 +153,5 @@ If you use Pulpie in your research, please cite: ---
-Built by Chonkie, the open-source work behind Feyn. +Built by Feyn.
From 7817f99ee36ee7cff8005ba339061cce385f26c5 Mon Sep 17 00:00:00 2001 From: "Bhavnick @ Chonkie" Date: Mon, 29 Jun 2026 21:56:08 -0700 Subject: [PATCH 2/3] README: static 'python 3.9+' badge instead of per-version list --- README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/README.md b/README.md index 4f3b8f8..73b1957 100644 --- a/README.md +++ b/README.md @@ -6,7 +6,7 @@ [![PyPI version](https://img.shields.io/pypi/v/pulpie.svg)](https://pypi.org/project/pulpie/) -[![Python versions](https://img.shields.io/pypi/pyversions/pulpie.svg)](https://pypi.org/project/pulpie/) +[![Python](https://img.shields.io/badge/python-3.9%2B-blue.svg)](https://pypi.org/project/pulpie/) [![License](https://img.shields.io/github/license/chonkie-inc/pulpie.svg)](https://github.com/chonkie-inc/pulpie/blob/main/LICENSE) [![Downloads](https://static.pepy.tech/badge/pulpie)](https://pepy.tech/project/pulpie) [![Blog](https://img.shields.io/badge/blog-read%20the%20writeup-E34C26.svg)](https://usefeyn.com/blog/pulpie-pareto-optimal-models-for-cleaning-the-web/) From 6093e7468fe48618ec0f41c7d3fee43df2d7d9bb Mon Sep 17 00:00:00 2001 From: "Bhavnick @ Chonkie" Date: Mon, 29 Jun 2026 21:59:06 -0700 Subject: [PATCH 3/3] README: formal bullets, qualify 20x as L4-specific, drop em dashes - Convert emoji feature list to plain markdown bullets - Qualify the 20x faster/cheaper claim as 'on an L4 GPU' (it's 20.1x on L4, 7.1x on A100 per the blog) so it doesn't read as universal - Remove all em/en dashes from prose per style preference --- README.md | 32 ++++++++++++++++---------------- 1 file changed, 16 insertions(+), 16 deletions(-) diff --git a/README.md b/README.md index 73b1957..f53c94d 100644 --- a/README.md +++ b/README.md @@ -12,7 +12,7 @@ [![Blog](https://img.shields.io/badge/blog-read%20the%20writeup-E34C26.svg)](https://usefeyn.com/blog/pulpie-pareto-optimal-models-for-cleaning-the-web/) [![GitHub stars](https://img.shields.io/github/stars/chonkie-inc/pulpie.svg)](https://github.com/chonkie-inc/pulpie/stargazers) -_Pareto-optimal models for cleaning the web — extract main content from HTML at one twentieth the cost._ +_Pareto-optimal models for cleaning the web. Extract main content from HTML at one twentieth the cost._ [Install](#installation) • [Usage](#usage) • @@ -23,14 +23,14 @@ _Pareto-optimal models for cleaning the web — extract main content from HTML a -Pulpie extracts the main content from raw HTML — stripping navigation, ads, sidebars, and footers — using small encoder models that label every block in a single forward pass. It approaches state-of-the-art extraction quality while running up to **20x faster** and **20x cheaper** than autoregressive extractors. +Pulpie extracts the main content from raw HTML, stripping navigation, ads, sidebars, and footers. It uses small encoder models that label every block in a single forward pass, approaching state-of-the-art extraction quality while running up to 20x faster and 20x cheaper than autoregressive extractors on an L4 GPU. -**⚡ Fast** — an encoder labels every block in one forward pass (13.7 pages/sec on an L4)
-**🎯 Accurate** — matches SOTA quality: 0.862–0.873 ROUGE-5 F1 on WebMainBench
-**🪶 Small** — the recommended model is 210M params, fits on any GPU
-**💸 Cheap** — clean 1 billion pages for ~$7,900 vs ~$159,000 for the leading decoder
-**📦 Simple** — `pip install pulpie`, then `Extractor().extract(html)`
-**🔌 Batched** — overlapped CPU+GPU pipeline scales across multiple GPUs
+- **Fast.** An encoder labels every block in one forward pass (13.7 pages/sec on an L4). +- **Accurate.** Matches state-of-the-art quality: 0.862 to 0.873 ROUGE-5 F1 on WebMainBench. +- **Small.** The recommended model is 210M parameters and fits on any GPU. +- **Cheap.** Clean 1 billion pages for ~$7,900, versus ~$159,000 for the leading decoder. +- **Simple.** Run `pip install pulpie`, then `Extractor().extract(html)`. +- **Batched.** An overlapped CPU and GPU pipeline scales across multiple GPUs. ## Installation @@ -94,7 +94,7 @@ All three models are built on [EuroBERT](https://arxiv.org/abs/2503.05500), shar | Model | Hugging Face | Params | ROUGE-5 F1 | Notes | |-------|--------------|--------|------------|-------| -| **Orange Small** | [`chonkie-ai/pulpie-orange-small`](https://huggingface.co/chonkie-ai/pulpie-orange-small) | 210M | 0.862 | **Recommended** — best size-to-quality ratio | +| **Orange Small** | [`chonkie-ai/pulpie-orange-small`](https://huggingface.co/chonkie-ai/pulpie-orange-small) | 210M | 0.862 | **Recommended**, best size-to-quality ratio | | Orange Base | [`chonkie-ai/pulpie-orange-base`](https://huggingface.co/chonkie-ai/pulpie-orange-base) | 610M | 0.863 | Distilled from Large | | Orange Large | [`chonkie-ai/pulpie-orange-large`](https://huggingface.co/chonkie-ai/pulpie-orange-large) | 2.1B | 0.873 | Teacher (highest quality) | @@ -104,12 +104,12 @@ All three models are built on [EuroBERT](https://arxiv.org/abs/2503.05500), shar Pulpie keeps the "read the page" approach of model-based extractors but moves the bottleneck from memory bandwidth to compute by using an encoder instead of a decoder. The pipeline runs in four stages: -1. **Simplify** — remove scripts, styles, and formatting noise; tag each content block with a unique ID. -2. **Chunk** — split, tokenize, and pack blocks into chunks of up to 8,192 tokens (≈80% of pages fit in one chunk). -3. **Classify** — a single encoder forward pass labels every block as content or boilerplate. -4. **Reconstruct** — return the kept blocks as HTML, or convert them to Markdown. +1. **Simplify.** Remove scripts, styles, and formatting noise; tag each content block with a unique ID. +2. **Chunk.** Split, tokenize, and pack blocks into chunks of up to 8,192 tokens (≈80% of pages fit in one chunk). +3. **Classify.** A single encoder forward pass labels every block as content or boilerplate. +4. **Reconstruct.** Return the kept blocks as HTML, or convert them to Markdown. -A decoder emits labels one token at a time, re-reading the full model from GPU memory each step. An encoder runs one dense forward pass over the whole input — so the gap widens on bandwidth-limited GPUs (7x faster than Dripper on A100, 20x on L4). +A decoder emits labels one token at a time, re-reading the full model from GPU memory each step. An encoder runs one dense forward pass over the whole input, so the gap widens on bandwidth-limited GPUs (7x faster than Dripper on A100, 20x on L4). ## Benchmarks @@ -121,8 +121,8 @@ Quality on the English subset of [WebMainBench](https://github.com/opendatalab/W | Dripper | 0.6B | 0.864 | 135 | | **Pulpie Orange Base** | 610M | 0.863 | 36 | | **Pulpie Orange Small** | 210M | 0.862 | 45 | -| magic-html | — | 0.700 | 384 | -| Trafilatura | — | 0.619 | 16 | +| magic-html | - | 0.700 | 384 | +| Trafilatura | - | 0.619 | 16 | Speed and cost (Pulpie Orange Small vs Dripper, 1 billion pages):