Metadata recall + content quality improvements (defuddle-inspired)#3
Merged
Conversation
Sourced from JSON-LD image (string or {url} object), then og:image,
then twitter:image. metadata-to-frontmatter emits image: alongside
icon:. upstream-frontmatter->metadata maps an upstream image: key to
:image instead of overloading :icon.
Fixes the markdown-negotiation follow-up where social-card images from
upstream markdown (e.g. cf-twitter-card.png) landed in the favicon slot.
Some CMSes leak unresolved templates into meta tags or JSON-LD —
{author.fullName}, #page.title, decorative '. -' strings. valid-metadata-value?
rejects these at the leaves (safe-attr, safe-text, get-json-ld-value,
get-json-ld-image) so first-non-blank chains fall through to the next
real source instead of accepting garbage that happens to be non-blank.
Some sites mistakenly put the full article title in og:site_name. Reject anything over 6 words and fall through to the next fallback — keeps :sitename meaningful instead of leaking article titles into it.
Read all JSON-LD script tags (not just the first), strip /* */ and // comments before parsing, flatten any @graph arrays, and prefer objects whose @type is article-like (Article, NewsArticle, BlogPosting, ScholarlyArticle, TechArticle, Report, WebPage, AboutPage). Recursively decode HTML entities in the chosen object so & and ' don't leak into metadata strings. News and blog sites publish either multiple scripts (one Organization, one Article) or one @graph wrapping both. Previous code only saw the first object and missed metadata that lived in the article-typed entry.
React/Next.js sites emit <div role=paragraph>, <div role=list>, <div role=listitem> that JSoup serializes as plain divs. The markdown converter then treats them as opaque blocks with no paragraph or list semantics. Convert these to <p>, <ul>, <li> early in clean-document so the rest of the pipeline sees real content structure.
decode-html-entities used (char (Integer/parseInt ...)) which truncates to 16 bits — 😀 (U+1F600 😀) and other supplementary code points produced garbage. Switch to Character/toString(int) which handles BMP and non-BMP code points uniformly. standardize-semantic-divs now also clears the role attribute after renaming the tag (role=paragraph on a <p> is redundant noise).
- Add brew tap install instructions - Update feature list to reflect current behaviour (markdown content negotiation, JSON-LD @graph walking, role-based pruning, semantic div standardisation) - Add --version to options - Refresh example output to include canonical-url, is-canonical, icon, image fields with a short note distinguishing icon from image - Bump SDKMAN GraalVM hint from 22-graal to 25-graal - Expand "How it works" with content negotiation and richer metadata description; mention Cloudflare-fronted sites and image dedupe in Special handling
446a826 to
7b16dc2
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Five focused improvements to metadata extraction and content cleaning, drawn from a comparison with defuddle. The bigger architectural shifts (extractor registry, scoring-based main-content discovery) are deferred.
:imageas a first-class metadata field — sourced from JSON-LDimage(string or{url}object), thenog:image, thentwitter:image. Fixes the markdown-negotiation follow-up where upstreamimage:frontmatter mismapped onto:icon.{author.fullName}), anchor placeholders (#author.name), and decorative strings without letters/digits at the leaves (safe-attr,safe-text,get-json-ld-value,get-json-ld-image).og:site_nameword-count guard — sites occasionally put the article title inog:site_name. Reject anything over 6 words and fall through.@graphwalking + multi-script + entity decode — read all<script type=application/ld+json>tags (not just the first), strip/* */and//comments, flatten@grapharrays, prefer Article-typed objects, recursively decode HTML entities.div[role=paragraph]→<p>,div[role=list]→<ul>,div[role=listitem]→<li>early inclean-documentso React/Next.js sites with role-based markup produce real paragraph/list output.Test plan
clojure -M:test— 99 tests / 255 assertions, 0 failures<p>/<ul>/<li>)clj-kondo— no new warningshttps://developers.cloudflare.com/docs-for-agents/index.md—image:populated from upstream YAML,icon:correctly emptyhttps://www.bbc.com/future/article/...—:imagedistinct from:icon(article hero vs favicon)https://www.wired.com/story/...— same; clean title/author/description from JSON-LDhttps://www.dbreunig.com/...— single-script JSON-LD path, no regressions<div role=paragraph>(worth doing before merging — none of the smoke URLs above exercise this code path, only the unit test)Deferred
Captured for follow-up:
articleBodyfallback as content source