Skip to content

Metadata recall + content quality improvements (defuddle-inspired)#3

Merged
dazld merged 8 commits into
mainfrom
dan/metadata-improvements
May 3, 2026
Merged

Metadata recall + content quality improvements (defuddle-inspired)#3
dazld merged 8 commits into
mainfrom
dan/metadata-improvements

Conversation

@dazld

@dazld dazld commented May 3, 2026

Copy link
Copy Markdown
Owner

Summary

Five focused improvements to metadata extraction and content cleaning, drawn from a comparison with defuddle. The bigger architectural shifts (extractor registry, scoring-based main-content discovery) are deferred.

  • :image as a first-class metadata field — sourced from JSON-LD image (string or {url} object), then og:image, then twitter:image. Fixes the markdown-negotiation follow-up where upstream image: frontmatter mismapped onto :icon.
  • Placeholder filter for metadata strings — rejects unresolved template literals ({author.fullName}), anchor placeholders (#author.name), and decorative strings without letters/digits at the leaves (safe-attr, safe-text, get-json-ld-value, get-json-ld-image).
  • og:site_name word-count guard — sites occasionally put the article title in og:site_name. Reject anything over 6 words and fall through.
  • JSON-LD @graph walking + multi-script + entity decode — read all <script type=application/ld+json> tags (not just the first), strip /* */ and // comments, flatten @graph arrays, prefer Article-typed objects, recursively decode HTML entities.
  • Semantic-div standardization — convert div[role=paragraph]<p>, div[role=list]<ul>, div[role=listitem]<li> early in clean-document so React/Next.js sites with role-based markup produce real paragraph/list output.

Test plan

  • clojure -M:test — 99 tests / 255 assertions, 0 failures
  • New tests cover each commit (image extraction from og/twitter/JSON-LD/object, placeholder rejection, sitename guard, @graph walking, multi-script preference, entity decoding, JSON comment stripping, role-div conversion to <p>/<ul>/<li>)
  • clj-kondo — no new warnings
  • Manual smoke tests against the GraalVM native binary:
    • https://developers.cloudflare.com/docs-for-agents/index.mdimage: populated from upstream YAML, icon: correctly empty
    • https://www.bbc.com/future/article/...:image distinct from :icon (article hero vs favicon)
    • https://www.wired.com/story/... — same; clean title/author/description from JSON-LD
    • https://www.dbreunig.com/... — single-script JSON-LD path, no regressions
  • Manual smoke test on a site known to use <div role=paragraph> (worth doing before merging — none of the smoke URLs above exercise this code path, only the unit test)

Deferred

Captured for follow-up:

  • Schema.org articleBody fallback as content source
  • Tiered retry strategy on low word counts
  • H1-adjacent date/byline scan
  • Extractor registry pattern (architectural — separate effort)

dazld added 8 commits May 3, 2026 08:40
Sourced from JSON-LD image (string or {url} object), then og:image,
then twitter:image. metadata-to-frontmatter emits image: alongside
icon:. upstream-frontmatter->metadata maps an upstream image: key to
:image instead of overloading :icon.

Fixes the markdown-negotiation follow-up where social-card images from
upstream markdown (e.g. cf-twitter-card.png) landed in the favicon slot.
Some CMSes leak unresolved templates into meta tags or JSON-LD —
{author.fullName}, #page.title, decorative '. -' strings. valid-metadata-value?
rejects these at the leaves (safe-attr, safe-text, get-json-ld-value,
get-json-ld-image) so first-non-blank chains fall through to the next
real source instead of accepting garbage that happens to be non-blank.
Some sites mistakenly put the full article title in og:site_name.
Reject anything over 6 words and fall through to the next fallback —
keeps :sitename meaningful instead of leaking article titles into it.
Read all JSON-LD script tags (not just the first), strip /* */ and
// comments before parsing, flatten any @graph arrays, and prefer
objects whose @type is article-like (Article, NewsArticle, BlogPosting,
ScholarlyArticle, TechArticle, Report, WebPage, AboutPage). Recursively
decode HTML entities in the chosen object so &amp; and &#39; don't leak
into metadata strings.

News and blog sites publish either multiple scripts (one Organization,
one Article) or one @graph wrapping both. Previous code only saw the
first object and missed metadata that lived in the article-typed entry.
React/Next.js sites emit <div role=paragraph>, <div role=list>,
<div role=listitem> that JSoup serializes as plain divs. The markdown
converter then treats them as opaque blocks with no paragraph or list
semantics. Convert these to <p>, <ul>, <li> early in clean-document
so the rest of the pipeline sees real content structure.
decode-html-entities used (char (Integer/parseInt ...)) which truncates
to 16 bits — &#128512; (U+1F600 😀) and other supplementary code points
produced garbage. Switch to Character/toString(int) which handles BMP
and non-BMP code points uniformly.

standardize-semantic-divs now also clears the role attribute after
renaming the tag (role=paragraph on a <p> is redundant noise).
- Add brew tap install instructions
- Update feature list to reflect current behaviour (markdown content
  negotiation, JSON-LD @graph walking, role-based pruning, semantic
  div standardisation)
- Add --version to options
- Refresh example output to include canonical-url, is-canonical, icon,
  image fields with a short note distinguishing icon from image
- Bump SDKMAN GraalVM hint from 22-graal to 25-graal
- Expand "How it works" with content negotiation and richer metadata
  description; mention Cloudflare-fronted sites and image dedupe in
  Special handling
@dazld dazld force-pushed the dan/metadata-improvements branch from 446a826 to 7b16dc2 Compare May 3, 2026 08:06
@dazld dazld merged commit 87a3baa into main May 3, 2026
2 checks passed
@dazld dazld deleted the dan/metadata-improvements branch May 3, 2026 08:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant