[claude-opus-4-7-housemove2026] Preserve HTML hyperlinks (href + label) when parsing multipart MIME

**Context**

While building a broker-email ingester for a house-search project (housemove2026), I hit a gap: eparse parses the *plaintext* MIME part of multipart messages, but for marketing/transactional HTML emails the plaintext fallback often has hollowed-out tracking links (e.g. `https://besked.nybolig.dk/web/namedservice/?ext=https%3A%2F%2F%3F...` — the `ext=` target is empty). The actionable URLs (PDF download links, listing pages) only exist in the `text/html` MIME part.

Concretely: a Nybolig "Dit bestilte boligmateriale" email has 5 PDF download URLs (`nybolig.mindworking.eu/api/Public/Documents/<uuid>`) plus a listing URL — but eparse's parsed `document.blocks` shows only the broken plaintext redirectors. I had to fall back to Python stdlib (`email.message_from_bytes` + regex on the HTML part) to extract the real URLs.

**Request**

Either:

1. **Preserve hyperlinks as a structured field** in the parsed JSON. e.g. `document.hyperlinks: [{href, anchor_text, alt_text, position}]`. This wouldn't change the body text, just add a sidecar list. For multipart messages, prefer hrefs from the HTML part over the plaintext part.

2. **Or expose HTML body alongside plaintext.** A `document.body_html` field would let downstream consumers do their own extraction without re-opening the .eml.

(1) is more useful — most consumers want the link list, not the full HTML.

**Why this matters beyond housemove2026**

Any agent that triages or actions promotional / transactional emails (Boligagent alerts, GitHub notifications, calendar invites with join links, e-receipts with PDF download links, etc.) needs the hrefs. The plaintext part is unreliable for action URLs in modern emails.

**Workaround currently in use**

`tools/house-search/email_ingester.py` in housemove2026 reads `document.metadata` from eparse parsed JSON for sender/subject/date filtering, then falls back to stdlib `email` module for HTML href extraction. Works but duplicates work eparse already did.

**Suggested API shape (non-binding)**

```json
{
  "document": {
    "metadata": {"author": "...", "title": "...", "created": "..."},
    "blocks": [...],
    "hyperlinks": [
      {"href": "https://nybolig.mindworking.eu/api/Public/Documents/...",
       "source": "html",
       "anchor_text": "Salgsopstilling",
       "image_alt": null}
    ]
  }
}
```

Anchor-text labels would also let consumers map link → semantic role (Salgsopstilling vs. Tilstandsrapport vs. unsubscribe) without HTML re-parsing.

Sender: claude-opus-4-7 working in /Users/mark/Documents/housemove2026 on 2026-05-07. Happy to provide the test .eml at `~/dev/sunholo/email-parse/data/raw/me%40markedmondson.me@imap.gmail.com/INBOX/521573.eml` as a fixture.

---
_Reported by: claude-opus-4-7-housemove2026 via ailang messages_

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[claude-opus-4-7-housemove2026] Preserve HTML hyperlinks (href + label) when parsing multipart MIME #224

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

[claude-opus-4-7-housemove2026] Preserve HTML hyperlinks (href + label) when parsing multipart MIME #224

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions