GitHub - feyninc/pulpie: Pareto-optimal models for cleaning the web — fast, encoder-based main-content extraction from HTML.

Pareto-optimal models for cleaning the web. Extract main content from HTML at one twentieth the cost.

Install • Usage • Models • How it works • Benchmarks • Blog

Pulpie extracts the main content from raw HTML, stripping navigation, ads, sidebars, and footers. It uses small encoder models that label every block in a single forward pass, approaching state-of-the-art extraction quality while running up to 20x faster and 20x cheaper than autoregressive extractors on an L4 GPU.

Fast. An encoder labels every block in one forward pass (13.7 pages/sec on an L4).
Accurate. Matches state-of-the-art quality: 0.862 to 0.873 ROUGE-5 F1 on WebMainBench.
Small. The recommended model is 210M parameters and fits on any GPU.
Cheap. Clean 1 billion pages for ~$7,900, versus ~$159,000 for the leading decoder.
Simple. Run pip install pulpie, then Extractor().extract(html).
Batched. An overlapped CPU and GPU pipeline scales across multiple GPUs.

Installation

pip install pulpie

For Markdown output, install the markdown extra:

pip install "pulpie[markdown]"

Or with uv:

uv pip install "pulpie[markdown]"

Usage

Basic

from pulpie import Extractor

extractor = Extractor()                # defaults to pulpie-orange-small (210M)
result = extractor.extract(html)

print(result.markdown)                 # clean Markdown
print(result.html)                     # clean HTML
print(result.n_main, result.n_other)   # blocks kept vs dropped

The model downloads from Hugging Face on first use.

Choosing a model

extractor = Extractor(model="orange-large")   # "orange-small" (default), "orange-base", "orange-large"
extractor = Extractor(model="path/to/model")  # or a custom checkpoint
extractor = Extractor(device="cpu")           # force CPU

Batch processing

For bulk extraction, Pipeline overlaps CPU preprocessing with GPU inference and self-balances across one or more GPUs:

from pulpie import Pipeline, PageInput

pipeline = Pipeline(model="orange-small")
results = pipeline.extract_batch(
    [PageInput(html=h, page_id=i) for i, h in enumerate(pages)]
)

Models

All three models are built on EuroBERT, share a tokenizer, and use the same <|sep|> block-marker architecture. Large is the teacher; Base and Small are distilled from it.

Model	Hugging Face	Params	ROUGE-5 F1	Notes
Orange Small	`feyninc/pulpie-orange-small`	210M	0.862	Recommended, best size-to-quality ratio
Orange Base	`feyninc/pulpie-orange-base`	610M	0.863	Distilled from Large
Orange Large	`feyninc/pulpie-orange-large`	2.1B	0.873	Teacher (highest quality)

orange-small is the default. Despite being a third the size of Dripper (the leading extractor), it matches its quality (0.862 vs 0.864) while running 20x faster.

How it works

Pulpie keeps the "read the page" approach of model-based extractors but moves the bottleneck from memory bandwidth to compute by using an encoder instead of a decoder. The pipeline runs in four stages:

Simplify. Remove scripts, styles, and formatting noise; tag each content block with a unique ID.
Chunk. Split, tokenize, and pack blocks into chunks of up to 8,192 tokens (≈80% of pages fit in one chunk).
Classify. A single encoder forward pass labels every block as content or boilerplate.
Reconstruct. Return the kept blocks as HTML, or convert them to Markdown.

A decoder emits labels one token at a time, re-reading the full model from GPU memory each step. An encoder runs one dense forward pass over the whole input, so the gap widens on bandwidth-limited GPUs (7x faster than Dripper on A100, 20x on L4).

Benchmarks

Quality on the English subset of WebMainBench (6,647 pages), ROUGE-5 F1:

Method	Params	ROUGE-5 F1	Empty pages
Pulpie Orange Large	2.1B	0.873	21
Dripper	0.6B	0.864	135
Pulpie Orange Base	610M	0.863	36
Pulpie Orange Small	210M	0.862	45
magic-html	-	0.700	384
Trafilatura	-	0.619	16

Speed and cost (Pulpie Orange Small vs Dripper, 1 billion pages):

	Pulpie Orange Small	Dripper
Throughput (L4)	13.7 pages/sec	0.68 pages/sec
Cost / 1B pages (L4)	~$7,900	~$159,000

Pulpie Orange Small matches Dripper's quality at 20x the throughput and 20x lower cost on an L4. See BENCHMARKS.md for the full comparison, per-difficulty breakdown, and reproduction command.

Acknowledgements

Pulpie builds directly on the work of the MinerU-HTML and Dripper team (Ma et al., 2025). Their simplify_html preprocessing, block-level annotation scheme, and the WebMainBench benchmark are foundational to this work. We also use their Dripper 0.6B model to cross-validate our training labels. We're grateful they released their tools and data.

Citation

If you use Pulpie in your research, please cite:

@note{pulpie2026,
  title  = {Pulpie: Pareto-Optimal Models for Cleaning the Web},
  author = {Minhas, Bhavnick and Nigam, Shreyash and Feyn Research},
  year   = {2026},
  venue  = {Feyn Field Notes}
}

Built by Feyn.

Name		Name	Last commit message	Last commit date
Latest commit History 34 Commits
.github/workflows		.github/workflows
assets		assets
eval		eval
scripts		scripts
src/pulpie		src/pulpie
tests		tests
.gitignore		.gitignore
BENCHMARKS.md		BENCHMARKS.md
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
README.md		README.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock
work.log		work.log

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Installation

Usage

Basic

Choosing a model

Batch processing

Models

How it works

Benchmarks

Acknowledgements

Citation

About

Uh oh!

Releases 2

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

Installation

Usage

Basic

Choosing a model

Batch processing

Models

How it works

Benchmarks

Acknowledgements

Citation

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 2

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages