Conversation
- Convert emoji feature list to plain markdown bullets - Qualify the 20x faster/cheaper claim as 'on an L4 GPU' (it's 20.1x on L4, 7.1x on A100 per the blog) so it doesn't read as universal - Remove all em/en dashes from prose per style preference
Add default_device() (cuda -> mps -> cpu) and use it in Extractor and Pipeline when device is unspecified. Previously device=None fell back straight to CPU on Macs even when MPS was available, silently leaving Apple acceleration unused.
…_id docs - Add pulpie/markdown.py with shared to_markdown() that strips tracking-pixel / spacer <img> tags (1x1, trans/blank/spacer, data-URI) before html2text. Replaces the duplicated html2text block in extractor + pipeline. - Pipeline now propagates per-page errors to PageResult.error (simplify failures and dropped pages were previously indistinguishable from empty pages). - Wire Pipeline(max_batch_tokens=...) through to inference (was silently hardcoded to 16384 in _infer_and_push). - Document page_id as an opaque echo (ordering is positional) and that error is set on failure.
- extract_batch(htmls): convenience list-in/list-out API that packs chunks across pages into shared forward passes. Verified on GPU box: 0/13 label mismatches vs per-page extract. Documented as a convenience (not faster than a loop with this model's eager attention; use Pipeline for scale throughput). - _classify stays simple/sequential (memory-safe); shared _chunk helper. - ExtractionResult is now a @DataClass (consistent with PageInput/PageResult), keeping the compact custom __repr__. Verified on gym-pro6000 (CUDA): correct across all 13 fixtures; 65 parity tests pass.
There was a problem hiding this comment.
Pull request overview
This PR focuses on developer-experience improvements across the extraction APIs by consolidating HTML→Markdown conversion, improving device auto-selection (incl. Apple MPS), surfacing per-page errors in Pipeline results, and adding a batched convenience API to Extractor.
Changes:
- Add shared
to_markdown()(with spacer/tracking-pixel<img>stripping) and use it in bothExtractorandPipeline. - Improve runtime behavior: propagate per-page failures via
PageResult.error, and wiremax_batch_tokensthrough the GPU batching path. - Add
Extractor.extract_batch(htmls)and convertExtractionResultto a@dataclasswhile keeping the compact__repr__.
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 1 comment.
Show a summary per file
| File | Description |
|---|---|
| src/pulpie/pipeline.py | Uses shared markdown conversion; propagates page-level errors; respects max_batch_tokens in _infer_and_push; adds error field to PageResult. |
| src/pulpie/model_utils.py | Adds default_device() (cuda → mps → cpu) for consistent device selection. |
| src/pulpie/markdown.py | New shared HTML→Markdown conversion with spacer-image stripping and optional html2text dependency. |
| src/pulpie/extractor.py | Uses shared markdown conversion; adds extract_batch; adds max_batch_tokens; converts ExtractionResult to @dataclass. |
| README.md | Markdown/formatting cleanup and copy edits. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Comment on lines
+176
to
+179
| attention_mask = torch.zeros((len(batch), max_len), dtype=torch.long, device=self.device) | ||
| for row, (_page_idx, chunk_ids, _bi) in enumerate(batch): | ||
| input_ids[row, : len(chunk_ids)] = torch.tensor(chunk_ids, dtype=torch.long) | ||
| attention_mask[row, : len(chunk_ids)] = 1 |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
A batch of developer-experience fixes, several found by actually running the library on Apple MPS and the GPU box.
Quick wins
pulpie/markdown.pywith a sharedto_markdown()that strips tracking-pixel / spacer<img>tags (1x1, trans/blank/spacer, data-URI) before html2text. Dedups the html2text block that was copy-pasted in extractor + pipeline.Pipelinenow surfaces per-page failures viaPageResult.error(simplify failures and dropped pages were previously indistinguishable from genuinely-empty pages).max_batch_tokenswired through:Pipeline(max_batch_tokens=...)was silently ignored (hardcoded 16384 in_infer_and_push); now respected.page_iddocumented as an opaque echo (ordering is positional);errordocumented.Beyond the quick wins
Extractor.extract_batch(htmls): convenience list-in/list-out API. Verified on the GPU box: 0/13 label mismatches vs per-pageextract. Documented honestly as a convenience, not a speedup (with this model's eager attention, batching N sequences costs the same as N sequential passes;Pipelineremains the scale path).ExtractionResultis now a@dataclass(consistent withPageInput/PageResult), keeping the compact custom__repr__.Note: MPS auto-detection landed separately in #17. Verified on gym-pro6000 (CUDA): correct across all 13 fixtures, no OOM; 65 parity tests pass.