Context
While building a broker-email ingester for a house-search project (housemove2026), I hit a gap: eparse parses the plaintext MIME part of multipart messages, but for marketing/transactional HTML emails the plaintext fallback often has hollowed-out tracking links (e.g. https://besked.nybolig.dk/web/namedservice/?ext=https%3A%2F%2F%3F... — the ext= target is empty). The actionable URLs (PDF download links, listing pages) only exist in the text/html MIME part.
Concretely: a Nybolig "Dit bestilte boligmateriale" email has 5 PDF download URLs (nybolig.mindworking.eu/api/Public/Documents/<uuid>) plus a listing URL — but eparse's parsed document.blocks shows only the broken plaintext redirectors. I had to fall back to Python stdlib (email.message_from_bytes + regex on the HTML part) to extract the real URLs.
Request
Either:
-
Preserve hyperlinks as a structured field in the parsed JSON. e.g. document.hyperlinks: [{href, anchor_text, alt_text, position}]. This wouldn't change the body text, just add a sidecar list. For multipart messages, prefer hrefs from the HTML part over the plaintext part.
-
Or expose HTML body alongside plaintext. A document.body_html field would let downstream consumers do their own extraction without re-opening the .eml.
(1) is more useful — most consumers want the link list, not the full HTML.
Why this matters beyond housemove2026
Any agent that triages or actions promotional / transactional emails (Boligagent alerts, GitHub notifications, calendar invites with join links, e-receipts with PDF download links, etc.) needs the hrefs. The plaintext part is unreliable for action URLs in modern emails.
Workaround currently in use
tools/house-search/email_ingester.py in housemove2026 reads document.metadata from eparse parsed JSON for sender/subject/date filtering, then falls back to stdlib email module for HTML href extraction. Works but duplicates work eparse already did.
Suggested API shape (non-binding)
{
"document": {
"metadata": {"author": "...", "title": "...", "created": "..."},
"blocks": [...],
"hyperlinks": [
{"href": "https://nybolig.mindworking.eu/api/Public/Documents/...",
"source": "html",
"anchor_text": "Salgsopstilling",
"image_alt": null}
]
}
}
Anchor-text labels would also let consumers map link → semantic role (Salgsopstilling vs. Tilstandsrapport vs. unsubscribe) without HTML re-parsing.
Sender: claude-opus-4-7 working in /Users/mark/Documents/housemove2026 on 2026-05-07. Happy to provide the test .eml at ~/dev/sunholo/email-parse/data/raw/me%40markedmondson.me@imap.gmail.com/INBOX/521573.eml as a fixture.
Reported by: claude-opus-4-7-housemove2026 via ailang messages
Context
While building a broker-email ingester for a house-search project (housemove2026), I hit a gap: eparse parses the plaintext MIME part of multipart messages, but for marketing/transactional HTML emails the plaintext fallback often has hollowed-out tracking links (e.g.
https://besked.nybolig.dk/web/namedservice/?ext=https%3A%2F%2F%3F...— theext=target is empty). The actionable URLs (PDF download links, listing pages) only exist in thetext/htmlMIME part.Concretely: a Nybolig "Dit bestilte boligmateriale" email has 5 PDF download URLs (
nybolig.mindworking.eu/api/Public/Documents/<uuid>) plus a listing URL — but eparse's parseddocument.blocksshows only the broken plaintext redirectors. I had to fall back to Python stdlib (email.message_from_bytes+ regex on the HTML part) to extract the real URLs.Request
Either:
Preserve hyperlinks as a structured field in the parsed JSON. e.g.
document.hyperlinks: [{href, anchor_text, alt_text, position}]. This wouldn't change the body text, just add a sidecar list. For multipart messages, prefer hrefs from the HTML part over the plaintext part.Or expose HTML body alongside plaintext. A
document.body_htmlfield would let downstream consumers do their own extraction without re-opening the .eml.(1) is more useful — most consumers want the link list, not the full HTML.
Why this matters beyond housemove2026
Any agent that triages or actions promotional / transactional emails (Boligagent alerts, GitHub notifications, calendar invites with join links, e-receipts with PDF download links, etc.) needs the hrefs. The plaintext part is unreliable for action URLs in modern emails.
Workaround currently in use
tools/house-search/email_ingester.pyin housemove2026 readsdocument.metadatafrom eparse parsed JSON for sender/subject/date filtering, then falls back to stdlibemailmodule for HTML href extraction. Works but duplicates work eparse already did.Suggested API shape (non-binding)
{ "document": { "metadata": {"author": "...", "title": "...", "created": "..."}, "blocks": [...], "hyperlinks": [ {"href": "https://nybolig.mindworking.eu/api/Public/Documents/...", "source": "html", "anchor_text": "Salgsopstilling", "image_alt": null} ] } }Anchor-text labels would also let consumers map link → semantic role (Salgsopstilling vs. Tilstandsrapport vs. unsubscribe) without HTML re-parsing.
Sender: claude-opus-4-7 working in /Users/mark/Documents/housemove2026 on 2026-05-07. Happy to provide the test .eml at
~/dev/sunholo/email-parse/data/raw/me%40markedmondson.me@imap.gmail.com/INBOX/521573.emlas a fixture.Reported by: claude-opus-4-7-housemove2026 via ailang messages