Skip to content

DX fixes: markdown cleanup, error propagation, extract_batch, dataclass#18

Merged
chonknick merged 6 commits into
mainfrom
dx-fixes
Jun 30, 2026
Merged

DX fixes: markdown cleanup, error propagation, extract_batch, dataclass#18
chonknick merged 6 commits into
mainfrom
dx-fixes

Conversation

@chonknick

Copy link
Copy Markdown
Contributor

A batch of developer-experience fixes, several found by actually running the library on Apple MPS and the GPU box.

Quick wins

  • Markdown cleanup: new pulpie/markdown.py with a shared to_markdown() that strips tracking-pixel / spacer <img> tags (1x1, trans/blank/spacer, data-URI) before html2text. Dedups the html2text block that was copy-pasted in extractor + pipeline.
  • Error propagation: Pipeline now surfaces per-page failures via PageResult.error (simplify failures and dropped pages were previously indistinguishable from genuinely-empty pages).
  • max_batch_tokens wired through: Pipeline(max_batch_tokens=...) was silently ignored (hardcoded 16384 in _infer_and_push); now respected.
  • page_id documented as an opaque echo (ordering is positional); error documented.

Beyond the quick wins

  • Extractor.extract_batch(htmls): convenience list-in/list-out API. Verified on the GPU box: 0/13 label mismatches vs per-page extract. Documented honestly as a convenience, not a speedup (with this model's eager attention, batching N sequences costs the same as N sequential passes; Pipeline remains the scale path).
  • ExtractionResult is now a @dataclass (consistent with PageInput/PageResult), keeping the compact custom __repr__.

Note: MPS auto-detection landed separately in #17. Verified on gym-pro6000 (CUDA): correct across all 13 fixtures, no OOM; 65 parity tests pass.

- Convert emoji feature list to plain markdown bullets
- Qualify the 20x faster/cheaper claim as 'on an L4 GPU' (it's 20.1x on L4,
  7.1x on A100 per the blog) so it doesn't read as universal
- Remove all em/en dashes from prose per style preference
Add default_device() (cuda -> mps -> cpu) and use it in Extractor and Pipeline
when device is unspecified. Previously device=None fell back straight to CPU on
Macs even when MPS was available, silently leaving Apple acceleration unused.
…_id docs

- Add pulpie/markdown.py with shared to_markdown() that strips tracking-pixel /
  spacer <img> tags (1x1, trans/blank/spacer, data-URI) before html2text.
  Replaces the duplicated html2text block in extractor + pipeline.
- Pipeline now propagates per-page errors to PageResult.error (simplify
  failures and dropped pages were previously indistinguishable from empty pages).
- Wire Pipeline(max_batch_tokens=...) through to inference (was silently
  hardcoded to 16384 in _infer_and_push).
- Document page_id as an opaque echo (ordering is positional) and that error is
  set on failure.
- extract_batch(htmls): convenience list-in/list-out API that packs chunks
  across pages into shared forward passes. Verified on GPU box: 0/13 label
  mismatches vs per-page extract. Documented as a convenience (not faster than
  a loop with this model's eager attention; use Pipeline for scale throughput).
- _classify stays simple/sequential (memory-safe); shared _chunk helper.
- ExtractionResult is now a @DataClass (consistent with PageInput/PageResult),
  keeping the compact custom __repr__.

Verified on gym-pro6000 (CUDA): correct across all 13 fixtures; 65 parity tests pass.
Copilot AI review requested due to automatic review settings June 30, 2026 06:44
@chonknick chonknick merged commit 70a2174 into main Jun 30, 2026
8 checks passed
@chonknick chonknick deleted the dx-fixes branch June 30, 2026 06:45

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR focuses on developer-experience improvements across the extraction APIs by consolidating HTML→Markdown conversion, improving device auto-selection (incl. Apple MPS), surfacing per-page errors in Pipeline results, and adding a batched convenience API to Extractor.

Changes:

  • Add shared to_markdown() (with spacer/tracking-pixel <img> stripping) and use it in both Extractor and Pipeline.
  • Improve runtime behavior: propagate per-page failures via PageResult.error, and wire max_batch_tokens through the GPU batching path.
  • Add Extractor.extract_batch(htmls) and convert ExtractionResult to a @dataclass while keeping the compact __repr__.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
src/pulpie/pipeline.py Uses shared markdown conversion; propagates page-level errors; respects max_batch_tokens in _infer_and_push; adds error field to PageResult.
src/pulpie/model_utils.py Adds default_device() (cuda → mps → cpu) for consistent device selection.
src/pulpie/markdown.py New shared HTML→Markdown conversion with spacer-image stripping and optional html2text dependency.
src/pulpie/extractor.py Uses shared markdown conversion; adds extract_batch; adds max_batch_tokens; converts ExtractionResult to @dataclass.
README.md Markdown/formatting cleanup and copy edits.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread src/pulpie/extractor.py
Comment on lines +176 to +179
attention_mask = torch.zeros((len(batch), max_len), dtype=torch.long, device=self.device)
for row, (_page_idx, chunk_ids, _bi) in enumerate(batch):
input_ids[row, : len(chunk_ids)] = torch.tensor(chunk_ids, dtype=torch.long)
attention_mask[row, : len(chunk_ids)] = 1
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants