DX fixes: markdown cleanup, error propagation, extract_batch, dataclass by chonknick · Pull Request #18 · feyninc/pulpie

chonknick · 2026-06-30T06:44:24Z

A batch of developer-experience fixes, several found by actually running the library on Apple MPS and the GPU box.

Quick wins

Markdown cleanup: new pulpie/markdown.py with a shared to_markdown() that strips tracking-pixel / spacer <img> tags (1x1, trans/blank/spacer, data-URI) before html2text. Dedups the html2text block that was copy-pasted in extractor + pipeline.
Error propagation: Pipeline now surfaces per-page failures via PageResult.error (simplify failures and dropped pages were previously indistinguishable from genuinely-empty pages).
max_batch_tokens wired through: Pipeline(max_batch_tokens=...) was silently ignored (hardcoded 16384 in _infer_and_push); now respected.
page_id documented as an opaque echo (ordering is positional); error documented.

Beyond the quick wins

Extractor.extract_batch(htmls): convenience list-in/list-out API. Verified on the GPU box: 0/13 label mismatches vs per-page extract. Documented honestly as a convenience, not a speedup (with this model's eager attention, batching N sequences costs the same as N sequential passes; Pipeline remains the scale path).
ExtractionResult is now a @dataclass (consistent with PageInput/PageResult), keeping the compact custom __repr__.

Note: MPS auto-detection landed separately in #17. Verified on gym-pro6000 (CUDA): correct across all 13 fixtures, no OOM; 65 parity tests pass.

- Convert emoji feature list to plain markdown bullets - Qualify the 20x faster/cheaper claim as 'on an L4 GPU' (it's 20.1x on L4, 7.1x on A100 per the blog) so it doesn't read as universal - Remove all em/en dashes from prose per style preference

Add default_device() (cuda -> mps -> cpu) and use it in Extractor and Pipeline when device is unspecified. Previously device=None fell back straight to CPU on Macs even when MPS was available, silently leaving Apple acceleration unused.

…_id docs - Add pulpie/markdown.py with shared to_markdown() that strips tracking-pixel / spacer <img> tags (1x1, trans/blank/spacer, data-URI) before html2text. Replaces the duplicated html2text block in extractor + pipeline. - Pipeline now propagates per-page errors to PageResult.error (simplify failures and dropped pages were previously indistinguishable from empty pages). - Wire Pipeline(max_batch_tokens=...) through to inference (was silently hardcoded to 16384 in _infer_and_push). - Document page_id as an opaque echo (ordering is positional) and that error is set on failure.

- extract_batch(htmls): convenience list-in/list-out API that packs chunks across pages into shared forward passes. Verified on GPU box: 0/13 label mismatches vs per-page extract. Documented as a convenience (not faster than a loop with this model's eager attention; use Pipeline for scale throughput). - _classify stays simple/sequential (memory-safe); shared _chunk helper. - ExtractionResult is now a @DataClass (consistent with PageInput/PageResult), keeping the compact custom __repr__. Verified on gym-pro6000 (CUDA): correct across all 13 fixtures; 65 parity tests pass.

Copilot

Pull request overview

This PR focuses on developer-experience improvements across the extraction APIs by consolidating HTML→Markdown conversion, improving device auto-selection (incl. Apple MPS), surfacing per-page errors in Pipeline results, and adding a batched convenience API to Extractor.

Changes:

Add shared to_markdown() (with spacer/tracking-pixel <img> stripping) and use it in both Extractor and Pipeline.
Improve runtime behavior: propagate per-page failures via PageResult.error, and wire max_batch_tokens through the GPU batching path.
Add Extractor.extract_batch(htmls) and convert ExtractionResult to a @dataclass while keeping the compact __repr__.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 1 comment.

Show a summary per file

File	Description
src/pulpie/pipeline.py	Uses shared markdown conversion; propagates page-level errors; respects `max_batch_tokens` in `_infer_and_push`; adds `error` field to `PageResult`.
src/pulpie/model_utils.py	Adds `default_device()` (cuda → mps → cpu) for consistent device selection.
src/pulpie/markdown.py	New shared HTML→Markdown conversion with spacer-image stripping and optional `html2text` dependency.
src/pulpie/extractor.py	Uses shared markdown conversion; adds `extract_batch`; adds `max_batch_tokens`; converts `ExtractionResult` to `@dataclass`.
README.md	Markdown/formatting cleanup and copy edits.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

+        attention_mask = torch.zeros((len(batch), max_len), dtype=torch.long, device=self.device)
+        for row, (_page_idx, chunk_ids, _bi) in enumerate(batch):
+            input_ids[row, : len(chunk_ids)] = torch.tensor(chunk_ids, dtype=torch.long)
+            attention_mask[row, : len(chunk_ids)] = 1


chonknick added 6 commits June 29, 2026 21:54

README footer: built by Feyn

b1d55ad

README: static 'python 3.9+' badge instead of per-version list

7817f99

Auto-detect Apple MPS device

b2c6bd3

Add default_device() (cuda -> mps -> cpu) and use it in Extractor and Pipeline when device is unspecified. Previously device=None fell back straight to CPU on Macs even when MPS was available, silently leaving Apple acceleration unused.

Copilot AI review requested due to automatic review settings June 30, 2026 06:44

Copilot started reviewing on behalf of chonknick June 30, 2026 06:44 View session

chonknick merged commit 70a2174 into main Jun 30, 2026
8 checks passed

chonknick deleted the dx-fixes branch June 30, 2026 06:45

Copilot AI reviewed Jun 30, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

DX fixes: markdown cleanup, error propagation, extract_batch, dataclass#18

DX fixes: markdown cleanup, error propagation, extract_batch, dataclass#18
chonknick merged 6 commits into
mainfrom
dx-fixes

chonknick commented Jun 30, 2026

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

chonknick commented Jun 30, 2026

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants