📚 ClawDocs

ClawDocs is an async documentation scraper that fetches, parses, and merges the official OpenClaw documentation into clean, category-organized Markdown files — optimized for AI ingestion (NotebookLM, RAG pipelines, LLM context windows).

✨ Features

Fetches all URLs from the OpenClaw sitemap automatically
Filters out irrelevant locales and platform-specific pages
Extracts main prose content (skips navbars, sidebars)
Groups output by documentation category into single merged .md files
Fetches the live CHANGELOG.md directly from the OpenClaw GitHub repo
Batched async HTTP requests to be polite to the remote server

🚀 Quick Start

Prerequisites

Python ≥ 3.11
uv (recommended) or pip

Run with `uv` (zero install friction)

# Standard run
uv run scrape_docs.py

# Clean output dir first (removes stale files)
uv run scrape_docs.py --clean

# Preview what would be scraped without writing files
uv run scrape_docs.py --dry-run

# Custom output dir, larger batches, longer timeout
uv run scrape_docs.py --output-dir ./my-docs --batch-size 20 --timeout 60

uv automatically resolves the inline PEP 723 dependencies (httpx, beautifulsoup4, markdownify).

Run with `pip`

pip install httpx beautifulsoup4 markdownify
python scrape_docs.py [flags]

All CLI flags

--output-dir PATH   Output directory (default: openclaw-docs-merged)
--batch-size N      Concurrent requests per batch (default: 15)
--timeout SECS      HTTP timeout in seconds (default: 30)
--clean             Delete output dir before run (no stale files)
--dry-run           Print URLs that would be scraped; write nothing

📂 Output Structure

The script writes merged Markdown files to ./openclaw-docs-merged/:

openclaw-docs-merged/
├── index.md          # Root-level docs
├── setup.md          # Setup & installation pages
├── configuration.md  # Configuration reference
├── plugins.md        # Plugin system docs
├── changelog.md      # Live CHANGELOG from GitHub
└── ...               # One file per top-level doc category

⚙️ Configuration

All tunable values can be set via CLI flags (take precedence) or environment variables:

Env var	CLI flag	Default	Purpose
`CLAWD_OUTPUT_DIR`	`--output-dir`	`openclaw-docs-merged`	Output directory
`CLAWD_BATCH_SIZE`	`--batch-size`	`15`	Concurrent requests per batch
`CLAWD_TIMEOUT`	`--timeout`	`30`	HTTP timeout (seconds)
—	`--clean`	off	Wipe output dir before run
—	`--dry-run`	off	Preview only, no writes

Hardcoded values (edit scrape_docs.py directly):

Constant	Default	Purpose
`SITEMAP_URL`	`https://docs.openclaw.ai/sitemap.xml`	Source sitemap
`CHANGELOG_URL`	GitHub raw CHANGELOG.md URL	Live changelog source
`EXCLUDE_PATTERNS`	`/ja-JP`, `/zh-CN`, iOS, Mac paths	URL patterns to skip
`BATCH_DELAY`	`0.5`	Sleep between batches (seconds)

🗺️ Roadmap

NotebookLM Sync: Direct integration via the NotebookLM API (or via Perplexity MCP pipelines) to autonomously create and update notebooks when upstream OpenClaw changes are detected.

🤝 Contributing

See CONTRIBUTING.md.

🔒 Security

See SECURITY.md.

📄 License

MIT © Johan

🤖 For AI Agents

If you are an AI agent analyzing or interacting with this repository, welcome! Here is some context to help you operate flawlessly here:

Deterministic Output: scrape_docs.py sorts pages by URL before writing to files. This minimizes diff noise, making PRs and historical tracking highly stable.
Issue Tracking (Linear): We use Linear for project management with bi-directional GitHub sync. To link your work to an issue, prefix your branch name or PR title with the Linear issue key (e.g., CV-101). Do not close issues manually; merging fixing PRs will auto-transition them.
CI/CD Restrictions: The main branch is strictly protected. Do not attempt direct pushes.
- Run uv run scrape_docs.py --dry-run and uv tool run ruff check . locally to ensure the CI (Lint & Test workflow) will pass before creating a PR.
Semantic Releases: We use automated release-please. Always use Conventional Commits (feat:, fix:, docs:, chore:, etc.) so the changelog generation works autonomously.
Context Window Friendly: The merged Markdown files in ./openclaw-docs-merged/ are designed specifically to be ingested back into RAG, NotebookLM, or standard LLM context windows. They omit noisy HTML like footers and navigation sidebars.
Legal Authority: The script respects robots.txt automation permissions and attributes MIT licensing in every generated output block to remain entirely legally compliant. You are safe to trigger full runs and analyze the subsequent data.

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
.agent		.agent
.github		.github
.gitignore		.gitignore
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
SECURITY.md		SECURITY.md
scrape_docs.py		scrape_docs.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

📚 ClawDocs

✨ Features

🚀 Quick Start

Prerequisites

Run with `uv` (zero install friction)

Run with `pip`

All CLI flags

📂 Output Structure

⚙️ Configuration

🗺️ Roadmap

🤝 Contributing

🔒 Security

📄 License

🤖 For AI Agents

About

Uh oh!

Releases 1

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

📚 ClawDocs

✨ Features

🚀 Quick Start

Prerequisites

Run with uv (zero install friction)

Run with pip

All CLI flags

📂 Output Structure

⚙️ Configuration

🗺️ Roadmap

🤝 Contributing

🔒 Security

📄 License

🤖 For AI Agents

About

Resources

License

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Run with `uv` (zero install friction)

Run with `pip`

Packages