Pareto-optimal models for cleaning the web. Extract main content from HTML at one twentieth the cost.
Install • Usage • Models • How it works • Benchmarks • Blog
Pulpie extracts the main content from raw HTML, stripping navigation, ads, sidebars, and footers. It uses small encoder models that label every block in a single forward pass, approaching state-of-the-art extraction quality while running up to 20x faster and 20x cheaper than autoregressive extractors on an L4 GPU.
- Fast. An encoder labels every block in one forward pass (13.7 pages/sec on an L4).
- Accurate. Matches state-of-the-art quality: 0.862 to 0.873 ROUGE-5 F1 on WebMainBench.
- Small. The recommended model is 210M parameters and fits on any GPU.
- Cheap. Clean 1 billion pages for ~$7,900, versus ~$159,000 for the leading decoder.
- Simple. Run
pip install pulpie, thenExtractor().extract(html). - Batched. An overlapped CPU and GPU pipeline scales across multiple GPUs.
pip install pulpieFor Markdown output, install the markdown extra:
pip install "pulpie[markdown]"Or with uv:
uv pip install "pulpie[markdown]"from pulpie import Extractor
extractor = Extractor() # defaults to pulpie-orange-small (210M)
result = extractor.extract(html)
print(result.markdown) # clean Markdown
print(result.html) # clean HTML
print(result.n_main, result.n_other) # blocks kept vs droppedThe model downloads from Hugging Face on first use.
extractor = Extractor(model="orange-large") # "orange-small" (default), "orange-base", "orange-large"
extractor = Extractor(model="path/to/model") # or a custom checkpoint
extractor = Extractor(device="cpu") # force CPUFor bulk extraction, Pipeline overlaps CPU preprocessing with GPU inference and self-balances across one or more GPUs:
from pulpie import Pipeline, PageInput
pipeline = Pipeline(model="orange-small")
results = pipeline.extract_batch(
[PageInput(html=h, page_id=i) for i, h in enumerate(pages)]
)All three models are built on EuroBERT, share a tokenizer, and use the same <|sep|> block-marker architecture. Large is the teacher; Base and Small are distilled from it.
| Model | Hugging Face | Params | ROUGE-5 F1 | Notes |
|---|---|---|---|---|
| Orange Small | feyninc/pulpie-orange-small |
210M | 0.862 | Recommended, best size-to-quality ratio |
| Orange Base | feyninc/pulpie-orange-base |
610M | 0.863 | Distilled from Large |
| Orange Large | feyninc/pulpie-orange-large |
2.1B | 0.873 | Teacher (highest quality) |
orange-small is the default. Despite being a third the size of Dripper (the leading extractor), it matches its quality (0.862 vs 0.864) while running 20x faster.
Pulpie keeps the "read the page" approach of model-based extractors but moves the bottleneck from memory bandwidth to compute by using an encoder instead of a decoder. The pipeline runs in four stages:
- Simplify. Remove scripts, styles, and formatting noise; tag each content block with a unique ID.
- Chunk. Split, tokenize, and pack blocks into chunks of up to 8,192 tokens (≈80% of pages fit in one chunk).
- Classify. A single encoder forward pass labels every block as content or boilerplate.
- Reconstruct. Return the kept blocks as HTML, or convert them to Markdown.
A decoder emits labels one token at a time, re-reading the full model from GPU memory each step. An encoder runs one dense forward pass over the whole input, so the gap widens on bandwidth-limited GPUs (7x faster than Dripper on A100, 20x on L4).
Quality on the English subset of WebMainBench (6,647 pages), ROUGE-5 F1:
| Method | Params | ROUGE-5 F1 | Empty pages |
|---|---|---|---|
| Pulpie Orange Large | 2.1B | 0.873 | 21 |
| Dripper | 0.6B | 0.864 | 135 |
| Pulpie Orange Base | 610M | 0.863 | 36 |
| Pulpie Orange Small | 210M | 0.862 | 45 |
| magic-html | - | 0.700 | 384 |
| Trafilatura | - | 0.619 | 16 |
Speed and cost (Pulpie Orange Small vs Dripper, 1 billion pages):
| Pulpie Orange Small | Dripper | |
|---|---|---|
| Throughput (L4) | 13.7 pages/sec | 0.68 pages/sec |
| Cost / 1B pages (L4) | ~$7,900 | ~$159,000 |
Pulpie Orange Small matches Dripper's quality at 20x the throughput and 20x lower cost on an L4. See BENCHMARKS.md for the full comparison, per-difficulty breakdown, and reproduction command.
Pulpie builds directly on the work of the MinerU-HTML and Dripper team (Ma et al., 2025). Their simplify_html preprocessing, block-level annotation scheme, and the WebMainBench benchmark are foundational to this work. We also use their Dripper 0.6B model to cross-validate our training labels. We're grateful they released their tools and data.
If you use Pulpie in your research, please cite:
@note{pulpie2026,
title = {Pulpie: Pareto-Optimal Models for Cleaning the Web},
author = {Minhas, Bhavnick and Nigam, Shreyash and Feyn Research},
year = {2026},
venue = {Feyn Field Notes}
}