Metadata recall + content quality improvements (defuddle-inspired) by dazld · Pull Request #3 · dazld/r11y

dazld · 2026-05-03T07:51:59Z

Summary

Five focused improvements to metadata extraction and content cleaning, drawn from a comparison with defuddle. The bigger architectural shifts (extractor registry, scoring-based main-content discovery) are deferred.

:image as a first-class metadata field — sourced from JSON-LD image (string or {url} object), then og:image, then twitter:image. Fixes the markdown-negotiation follow-up where upstream image: frontmatter mismapped onto :icon.
Placeholder filter for metadata strings — rejects unresolved template literals ({author.fullName}), anchor placeholders (#author.name), and decorative strings without letters/digits at the leaves (safe-attr, safe-text, get-json-ld-value, get-json-ld-image).
og:site_name word-count guard — sites occasionally put the article title in og:site_name. Reject anything over 6 words and fall through.
JSON-LD @graph walking + multi-script + entity decode — read all <script type=application/ld+json> tags (not just the first), strip /* */ and // comments, flatten @graph arrays, prefer Article-typed objects, recursively decode HTML entities.
Semantic-div standardization — convert div[role=paragraph] → <p>, div[role=list] → <ul>, div[role=listitem] → <li> early in clean-document so React/Next.js sites with role-based markup produce real paragraph/list output.

Test plan

clojure -M:test — 99 tests / 255 assertions, 0 failures
New tests cover each commit (image extraction from og/twitter/JSON-LD/object, placeholder rejection, sitename guard, @graph walking, multi-script preference, entity decoding, JSON comment stripping, role-div conversion to <p>/<ul>/<li>)
clj-kondo — no new warnings
Manual smoke tests against the GraalVM native binary:
- https://developers.cloudflare.com/docs-for-agents/index.md — image: populated from upstream YAML, icon: correctly empty
- https://www.bbc.com/future/article/... — :image distinct from :icon (article hero vs favicon)
- https://www.wired.com/story/... — same; clean title/author/description from JSON-LD
- https://www.dbreunig.com/... — single-script JSON-LD path, no regressions
Manual smoke test on a site known to use <div role=paragraph> (worth doing before merging — none of the smoke URLs above exercise this code path, only the unit test)

Deferred

Captured for follow-up:

Schema.org articleBody fallback as content source
Tiered retry strategy on low word counts
H1-adjacent date/byline scan
Extractor registry pattern (architectural — separate effort)

Sourced from JSON-LD image (string or {url} object), then og:image, then twitter:image. metadata-to-frontmatter emits image: alongside icon:. upstream-frontmatter->metadata maps an upstream image: key to :image instead of overloading :icon. Fixes the markdown-negotiation follow-up where social-card images from upstream markdown (e.g. cf-twitter-card.png) landed in the favicon slot.

Some CMSes leak unresolved templates into meta tags or JSON-LD — {author.fullName}, #page.title, decorative '. -' strings. valid-metadata-value? rejects these at the leaves (safe-attr, safe-text, get-json-ld-value, get-json-ld-image) so first-non-blank chains fall through to the next real source instead of accepting garbage that happens to be non-blank.

Some sites mistakenly put the full article title in og:site_name. Reject anything over 6 words and fall through to the next fallback — keeps :sitename meaningful instead of leaking article titles into it.

@graph

Read all JSON-LD script tags (not just the first), strip /* */ and // comments before parsing, flatten any @graph arrays, and prefer objects whose @type is article-like (Article, NewsArticle, BlogPosting, ScholarlyArticle, TechArticle, Report, WebPage, AboutPage). Recursively decode HTML entities in the chosen object so & and ' don't leak into metadata strings. News and blog sites publish either multiple scripts (one Organization, one Article) or one @graph wrapping both. Previous code only saw the first object and missed metadata that lived in the article-typed entry.

React/Next.js sites emit <div role=paragraph>, <div role=list>, <div role=listitem> that JSoup serializes as plain divs. The markdown converter then treats them as opaque blocks with no paragraph or list semantics. Convert these to <p>, <ul>, <li> early in clean-document so the rest of the pipeline sees real content structure.

decode-html-entities used (char (Integer/parseInt ...)) which truncates to 16 bits — 😀 (U+1F600 😀) and other supplementary code points produced garbage. Switch to Character/toString(int) which handles BMP and non-BMP code points uniformly. standardize-semantic-divs now also clears the role attribute after renaming the tag (role=paragraph on a <p> is redundant noise).

@graph

- Add brew tap install instructions - Update feature list to reflect current behaviour (markdown content negotiation, JSON-LD @graph walking, role-based pruning, semantic div standardisation) - Add --version to options - Refresh example output to include canonical-url, is-canonical, icon, image fields with a short note distinguishing icon from image - Bump SDKMAN GraalVM hint from 22-graal to 25-graal - Expand "How it works" with content negotiation and richer metadata description; mention Cloudflare-fronted sites and image dedupe in Special handling

dazld added 8 commits May 3, 2026 08:40

guard og:site_name against being over 6 words

181330b

Some sites mistakenly put the full article title in og:site_name. Reject anything over 6 words and fall through to the next fallback — keeps :sitename meaningful instead of leaking article titles into it.

restore babashka compatibility note + bb usage example, small cleanup

7b16dc2

dazld force-pushed the dan/metadata-improvements branch from 446a826 to 7b16dc2 Compare May 3, 2026 08:06

dazld merged commit 87a3baa into main May 3, 2026
2 checks passed

dazld deleted the dan/metadata-improvements branch May 3, 2026 08:08

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Metadata recall + content quality improvements (defuddle-inspired)#3

Metadata recall + content quality improvements (defuddle-inspired)#3
dazld merged 8 commits into
mainfrom
dan/metadata-improvements

dazld commented May 3, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

dazld commented May 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Deferred

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

dazld commented May 3, 2026 •

edited

Loading