Readability JS Server is an HTTP microservice that wraps Mozilla's Readability.js library to extract clean, readable content from web pages. It provides a simple REST API endpoint that accepts a URL and returns the parsed article content with metadata.
- Extract readable content from web pages
- Remove ads, navigation, and other clutter
- Return structured article data (title, content, excerpt, etc.)
- Deploy as a containerized service
- Runtime: Node.js 20 (Alpine Linux)
- Framework: Express.js 5.x
- Core Library: @mozilla/readability 0.6.0
- DOM Processing: jsdom 27.4.0
- Content Sanitization: DOMPurify 3.3.1
- HTTP Client: axios 1.13.2
- Process Manager: PM2 (5 instances in production)
- Logging: log-timestamp
- Single Express application (
src/app.js) - One POST endpoint at root (
/) - Stateless service (no session/state management)
- Runs multiple PM2 instances for load distribution
POST /
Content-Type: application/json
Body:
{
"url": "https://example.com/article"
}Required Fields:
url(string): The URL of the web page to extract content from
Success Response (HTTP 200):
{
"url": "https://example.com/article",
"title": "Article Title",
"byline": "Author Name",
"dir": "ltr",
"content": "<div>...sanitized HTML content...</div>",
"length": 12345,
"excerpt": "Article excerpt...",
"siteName": "Site Name"
}Error Responses:
-
400 Bad Request: Invalid or missing URL
{ "error": "Send JSON, like so: {\"url\": \"https://url/to/whatever\"}" } -
500 Internal Server Error: Failed to fetch or parse content
{ "error": "Some weird error fetching the content", "details": { ...error object... } }
All properties returned match Mozilla's Readability.js parse output:
url: The requested URL (echoed back)title: Article titlebyline: Author information (may be null)dir: Text direction ("ltr" or "rtl")content: Sanitized HTML content (allows iframe and video tags for media)length: Character count of the contentexcerpt: Article excerpt/summarysiteName: Site name (may be null)
curl -XPOST http://localhost:3000/ \
-H "Content-Type: application/json" \
-d '{"url": "https://en.wikipedia.org/wiki/Firefox"}'const axios = require('axios');
async function extractContent(url) {
try {
const response = await axios.post('http://localhost:3000/', {
url: url
});
return response.data;
} catch (error) {
console.error('Error:', error.response?.data || error.message);
throw error;
}
}
// Usage
extractContent('https://example.com/article')
.then(data => {
console.log('Title:', data.title);
console.log('Content length:', data.length);
console.log('Excerpt:', data.excerpt);
});import requests
def extract_content(url):
response = requests.post(
'http://localhost:3000/',
json={'url': url},
headers={'Content-Type': 'application/json'}
)
response.raise_for_status()
return response.json()
# Usage
data = extract_content('https://example.com/article')
print(f"Title: {data['title']}")
print(f"Content length: {data['length']}")
print(f"Excerpt: {data['excerpt']}")- Node.js >= 10
- Yarn package manager
# Install dependencies
yarn install
# Start development server (with nodemon)
yarn startyarn start: Start development server with auto-reloadyarn prettier -c src/: Check code formattingyarn prettier -w src/: Fix code formatting
make install: Install dependenciesmake start: Start the servermake lint: Check code formattingmake lint-fix: Fix code formattingmake build-container: Build Docker imagemake run-container: Run Docker containermake example-request: Test the API with example request
Image: phpdockerio/readability-js-server
Supported Architectures:
linux/amd64linux/arm64linux/arm/v7(up to version 1.5.0)
Versioning: Uses semantic versioning with tags:
latest: Latest versionx.x.x: Specific version (e.g.,1.7.2)x.x: Minor version (e.g.,1.7)x: Major version (e.g.,1)
Run Container:
docker run -p 3000:3000 phpdockerio/readability-js-serverProduction Configuration:
- Runs on port 3000
- Uses PM2 with 5 instances
- Node.js 20 Alpine base image
- Non-root user (
readability)
Currently, no configuration is required. The service runs with defaults:
- Port: 3000
- PM2 instances: 5
- Environment: production (when using Docker)
- Uses DOMPurify to sanitize fetched HTML
- Allows
iframeandvideotags (for YouTube videos and media content) - Removes potentially dangerous scripts and elements
- Receive POST request with URL
- Fetch HTML content from URL using axios
- Sanitize HTML with DOMPurify
- Parse HTML with jsdom
- Extract readable content with Readability.js
- Return structured JSON response
- Validates URL presence in request body
- Handles HTTP fetch errors
- Handles parsing errors
- Returns appropriate HTTP status codes
- Uses
log-timestampfor timestamped console logs - Logs fetch operations and success/failure
- Logs server startup with version information
- No Authentication: The service has no built-in authentication or rate limiting
- Single Endpoint: Only one endpoint (
POST /) is available - No Caching: Each request fetches content fresh from the source
- Content Sanitization: Allows iframes and videos, which may have security implications
- Error Messages: Generic error messages may not provide detailed debugging information
- No Configuration: Hardcoded port (3000) and PM2 instances (5)
- Extract readable content from web pages
- Remove navigation, ads, and boilerplate
- Get structured article metadata
- Process web content for analysis or storage
- Error Handling: Always handle HTTP errors and check response status
- URL Validation: Validate URLs before sending requests
- Timeout Handling: Implement request timeouts for reliability
- Rate Limiting: Implement client-side rate limiting if needed
- Content Validation: Verify response structure before processing
- Content aggregation services
- Article readers and parsers
- Content analysis tools
- Web scraping pipelines
- RSS feed enhancement
Current version: 1.7.2
Version is stored in the release file and displayed on server startup.
Apache-2.0