diff --git a/README.md b/README.md index 0f4f4f4..2d5401e 100644 --- a/README.md +++ b/README.md @@ -1,30 +1,35 @@ # r11y -A fast, GraalVM-compiled CLI tool for extracting readable content from web pages as Markdown. +A lightning fast, GraalVM-compiled CLI tool for extracting readable content from web pages as Markdown. + +`r11y` as in `readability` - or "oh rlly?" if you're ancient and remember the terrible owl meme. ## Features - Extract main content from any URL as clean Markdown -- Preserves whitespace in preformatted blocks -- Rich metadata extraction with YAML frontmatter (title, author, date, description) -- JSON-LD structured data support +- **Preserves whitespace** in preformatted blocks +- Rich metadata extraction with YAML frontmatter (title, author, date, description, canonical URL, hero image, favicon, sitename) +- JSON-LD structured data support, including `@graph` walking and multi-script preference for article-typed objects +- Markdown content negotiation — sends `Accept: text/markdown` and recognises markdown bodies even when servers mis-label them as `text/html` (e.g. Cloudflare-fronted docs) +- Standardises React/Next.js semantic divs (`role=paragraph`, `role=list`) into proper HTML so content structure survives extraction +- Removes decorative SVGs, spacer images, layout tables, and duplicated UI chrome - GitHub-optimized extraction (README files, blob content) - Configurable link density threshold for content filtering +- Babashka-compatible — usable from `bb` scripts via `:git/tag` deps, no GraalVM required - Fast startup with GraalVM native compilation (~40ms) -## Notes on repo - -This is a personal tool I've been using in my own projects - I specifically wanted a way to get URLs without clobbering the whitespace, -and I couldn't find a tool that did that. I've used and recommend trafilatura before - but given it collapsed whitespace, and was very much -a python project, I wanted to explore building a Clojure & Graal tool to do similar, and here we go. +## Installation -It's not as battle-tested as other more mature extraction tools, but PRs are welcome to improve this. +### Homebrew (macOS arm64, Linux x86_64) -## Installation +```bash +brew tap dazld/tap +brew install r11y +``` -### Prebuilt Binary (Linux x86_64) +### Prebuilt Binary -Download the latest binary from [GitHub Releases](https://github.com/dazld/r11y/releases). +Download the latest binary for macOS (arm64) or Linux (x86_64) from [GitHub Releases](https://github.com/dazld/r11y/releases). ### Quick Build @@ -70,8 +75,8 @@ brew install --cask graalvm-jdk **Option 2: Using SDKMAN:** ```bash -sdk install java 22-graal -sdk use java 22-graal +sdk install java 25-graal +sdk use java 25-graal ``` #### Building @@ -109,6 +114,7 @@ r11y --help - `-m, --with-metadata` - Include YAML frontmatter with metadata (title, author, date, description, etc.) - `-l, --link-density N` - Link density threshold 0-1 (default: 0.5). Lower values are more aggressive at filtering link-heavy content. +- `-v, --version` - Show version - `-h, --help` - Show help message ### Example Output with Metadata @@ -117,16 +123,22 @@ r11y --help --- title: Intelligence on Earth Evolved Independently at Least Twice author: Yasemin Saplakoglu -url: https://www.wired.com/story/intelligence-evolved... +url: https://www.wired.com/story/intelligence-evolved-at-least-twice-in-vertebrate-animals/ +canonical-url: https://www.wired.com/story/intelligence-evolved-at-least-twice-in-vertebrate-animals/ +is-canonical: true hostname: www.wired.com -description: Complex neural circuits likely arose independently... +description: Complex neural circuits likely arose independently in birds and mammals... sitename: WIRED date: 2025-05-11T07:00:00.000-04:00 +icon: https://www.wired.com/verso/static/wired-us/assets/favicon.ico +image: https://media.wired.com/photos/.../NeuralIntelligence-crSamanthaMash-Lede.jpeg --- # Article content here... ``` +`icon` is the site favicon (largest available, Apple touch icon preferred). `image` is the article hero / social-card image (`og:image` / `twitter:image` / JSON-LD `image`). + ## Development ### Run with Clojure CLI @@ -141,23 +153,35 @@ clj -M -m r11y.core https://example.com clj -e "(require '[r11y.lib.html :as html]) (println (html/extract-content-from-url \"https://clojure.org\" :format :markdown))" ``` +### Use from a babashka script + +```bash +bb -Sdeps '{:deps {io.github.dazld/r11y {:git/tag "v1.0.5" :git/sha "aabc910"}}}' \ + -e '(require (quote [r11y.lib.html :as html])) + (println (:markdown (html/extract-content-from-url "https://example.com" :format :markdown)))' +``` + +No GraalVM required — bb resolves the dep, downloads JSoup transitively, and runs the extractor. Useful for one-off scripts where you don't want to install the native binary. + ## How it works r11y uses content extraction algorithms inspired by Mozilla's Readability and trafilatura to identify and extract the main content from web pages: -1. **Metadata extraction**: Pulls structured data from JSON-LD, OpenGraph tags, meta tags, `