AGENTS.md - Readability JS Server

Project Overview

Readability JS Server is an HTTP microservice that wraps Mozilla's Readability.js library to extract clean, readable content from web pages. It provides a simple REST API endpoint that accepts a URL and returns the parsed article content with metadata.

Purpose

Extract readable content from web pages
Remove ads, navigation, and other clutter
Return structured article data (title, content, excerpt, etc.)
Deploy as a containerized service

Architecture

Technology Stack

Runtime: Node.js 20 (Alpine Linux)
Framework: Express.js 5.x
Core Library: @mozilla/readability 0.6.0
DOM Processing: jsdom 27.4.0
Content Sanitization: DOMPurify 3.3.1
HTTP Client: axios 1.13.2
Process Manager: PM2 (5 instances in production)
Logging: log-timestamp

Service Architecture

Single Express application (src/app.js)
One POST endpoint at root (/)
Stateless service (no session/state management)
Runs multiple PM2 instances for load distribution

API Specification

Endpoint

POST /

Request Format

Content-Type: application/json

Body:

{
  "url": "https://example.com/article"
}

Required Fields:

url (string): The URL of the web page to extract content from

Response Format

Success Response (HTTP 200):

{
  "url": "https://example.com/article",
  "title": "Article Title",
  "byline": "Author Name",
  "dir": "ltr",
  "content": "<div>...sanitized HTML content...</div>",
  "length": 12345,
  "excerpt": "Article excerpt...",
  "siteName": "Site Name"
}

Error Responses:

400 Bad Request: Invalid or missing URL

{
  "error": "Send JSON, like so: {\"url\": \"https://url/to/whatever\"}"
}

500 Internal Server Error: Failed to fetch or parse content

{
  "error": "Some weird error fetching the content",
  "details": { ...error object... }
}

Response Properties

All properties returned match Mozilla's Readability.js parse output:

url: The requested URL (echoed back)
title: Article title
byline: Author information (may be null)
dir: Text direction ("ltr" or "rtl")
content: Sanitized HTML content (allows iframe and video tags for media)
length: Character count of the content
excerpt: Article excerpt/summary
siteName: Site name (may be null)

Usage Examples

cURL Example

curl -XPOST http://localhost:3000/ \
  -H "Content-Type: application/json" \
  -d '{"url": "https://en.wikipedia.org/wiki/Firefox"}'

JavaScript/Node.js Example

const axios = require('axios');

async function extractContent(url) {
  try {
    const response = await axios.post('http://localhost:3000/', {
      url: url
    });
    return response.data;
  } catch (error) {
    console.error('Error:', error.response?.data || error.message);
    throw error;
  }
}

// Usage
extractContent('https://example.com/article')
  .then(data => {
    console.log('Title:', data.title);
    console.log('Content length:', data.length);
    console.log('Excerpt:', data.excerpt);
  });

Python Example

import requests

def extract_content(url):
    response = requests.post(
        'http://localhost:3000/',
        json={'url': url},
        headers={'Content-Type': 'application/json'}
    )
    response.raise_for_status()
    return response.json()

# Usage
data = extract_content('https://example.com/article')
print(f"Title: {data['title']}")
print(f"Content length: {data['length']}")
print(f"Excerpt: {data['excerpt']}")

Development

Prerequisites

Node.js >= 10
Yarn package manager

Local Setup

# Install dependencies
yarn install

# Start development server (with nodemon)
yarn start

Available Commands

yarn start: Start development server with auto-reload
yarn prettier -c src/: Check code formatting
yarn prettier -w src/: Fix code formatting

Makefile Targets

make install: Install dependencies
make start: Start the server
make lint: Check code formatting
make lint-fix: Fix code formatting
make build-container: Build Docker image
make run-container: Run Docker container
make example-request: Test the API with example request

Deployment

Docker

Image: phpdockerio/readability-js-server

Supported Architectures:

linux/amd64
linux/arm64
linux/arm/v7 (up to version 1.5.0)

Versioning: Uses semantic versioning with tags:

latest: Latest version
x.x.x: Specific version (e.g., 1.7.2)
x.x: Minor version (e.g., 1.7)
x: Major version (e.g., 1)

Run Container:

docker run -p 3000:3000 phpdockerio/readability-js-server

Production Configuration:

Runs on port 3000
Uses PM2 with 5 instances
Node.js 20 Alpine base image
Non-root user (readability)

Environment Variables

Currently, no configuration is required. The service runs with defaults:

Port: 3000
PM2 instances: 5
Environment: production (when using Docker)

Technical Details

Content Sanitization

Uses DOMPurify to sanitize fetched HTML
Allows iframe and video tags (for YouTube videos and media content)
Removes potentially dangerous scripts and elements

Processing Flow

Receive POST request with URL
Fetch HTML content from URL using axios
Sanitize HTML with DOMPurify
Parse HTML with jsdom
Extract readable content with Readability.js
Return structured JSON response

Error Handling

Validates URL presence in request body
Handles HTTP fetch errors
Handles parsing errors
Returns appropriate HTTP status codes

Logging

Uses log-timestamp for timestamped console logs
Logs fetch operations and success/failure
Logs server startup with version information

Limitations & Considerations

No Authentication: The service has no built-in authentication or rate limiting
Single Endpoint: Only one endpoint (POST /) is available
No Caching: Each request fetches content fresh from the source
Content Sanitization: Allows iframes and videos, which may have security implications
Error Messages: Generic error messages may not provide detailed debugging information
No Configuration: Hardcoded port (3000) and PM2 instances (5)

Integration Guidelines for AI Agents

When to Use This Service

Extract readable content from web pages
Remove navigation, ads, and boilerplate
Get structured article metadata
Process web content for analysis or storage

Best Practices

Error Handling: Always handle HTTP errors and check response status
URL Validation: Validate URLs before sending requests
Timeout Handling: Implement request timeouts for reliability
Rate Limiting: Implement client-side rate limiting if needed
Content Validation: Verify response structure before processing

Common Use Cases

Content aggregation services
Article readers and parsers
Content analysis tools
Web scraping pipelines
RSS feed enhancement

Version Information

Current version: 1.7.2

Version is stored in the release file and displayed on server startup.

License

Apache-2.0

Repository

https://github.com/phpdocker-io/readability-js-server

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AGENTS.md - Readability JS Server

Project Overview

Purpose

Architecture

Technology Stack

Service Architecture

API Specification

Endpoint

Request Format

Response Format

Response Properties

Usage Examples

cURL Example

JavaScript/Node.js Example

Python Example

Development

Prerequisites

Local Setup

Available Commands

Makefile Targets

Deployment

Docker

Environment Variables

Technical Details

Content Sanitization

Processing Flow

Error Handling

Logging

Limitations & Considerations

Integration Guidelines for AI Agents

When to Use This Service

Best Practices

Common Use Cases

Version Information

License

Repository

FilesExpand file tree

AGENTS.md

Latest commit

History

AGENTS.md

File metadata and controls

AGENTS.md - Readability JS Server

Project Overview

Purpose

Architecture

Technology Stack

Service Architecture

API Specification

Endpoint

Request Format

Response Format

Response Properties

Usage Examples

cURL Example

JavaScript/Node.js Example

Python Example

Development

Prerequisites

Local Setup

Available Commands

Makefile Targets

Deployment

Docker

Environment Variables

Technical Details

Content Sanitization

Processing Flow

Error Handling

Logging

Limitations & Considerations

Integration Guidelines for AI Agents

When to Use This Service

Best Practices

Common Use Cases

Version Information

License

Repository