SiteOne Crawler

SiteOne Crawler is a powerful and easy-to-use website analyzer, cloner, and converter designed for developers seeking security and performance insights, SEO specialists identifying optimization opportunities, and website owners needing reliable backups and offline versions.

Now rewritten in Rust for maximum performance, minimal resource usage, and zero runtime dependencies. The transition from PHP+Swoole to Rust resulted in 25% faster execution and 30% lower memory consumption while producing identical output.

Discover the SiteOne Crawler advantage:

Run Anywhere: Single native binary for 🪟 Windows, 🍎 macOS, and 🐧 Linux (x64 & arm64). No runtime dependencies.
Work Your Way: Launch the binary without arguments for an interactive wizard 🧙 with 10 preset modes, use the extensive command-line interface 📟 (releases, ▶️ video) for automation and power, or enjoy the intuitive desktop GUI application 💻 (GUI app, ▶️ video) for visual control.
Rich Output Formats: Interactive HTML audit report 📊 with sortable tables and quality scoring (0.0-10.0) (see nextjs.org sample), detailed JSON for programmatic consumption, and human-readable text for terminal. Send HTML reports directly to your inbox via built-in SMTP mailer 📧.
CI/CD Integration: Built-in quality gate (--ci) with configurable thresholds — exit code 10 on failure enables automated deployment blocking. Also useful for cache warming — crawling the entire site after deployment populates your reverse proxy/CDN cache.
Offline & Markdown Power: Create complete offline clones 💾 for browsing without a server (nextjs.org clone) or convert entire websites into clean Markdown 📝 — perfect for backups, documentation, or feeding content to AI models (examples).
Deep Crawling & Analysis: Thoroughly crawl every page and asset, identify errors (404s, redirects), generate sitemaps 🗺️, and even get email summaries 📧 (watch ▶️ video example).
Learn More: Dive into the 🌐 Project Website, explore the detailed Documentation, or check the JSON/Text output specs.

GIF animation of the crawler in action (also available as a ▶️ video):

✨ Features

In short, the main benefits can be summarized in these points:

🕷️ Crawler - very powerful crawler of the entire website reporting useful information about each URL (status code, response time, size, custom headers, titles, etc.)
🛠️ Dev/DevOps assistant - offers stress/load testing with configurable concurrent workers (--workers) and request rate (--max-reqs-per-sec), cache warming, localhost testing, and rich URL/content-type filtering
📊 Analyzer - analyzes all webpages and reports strange or error behaviour and useful statistics (404, redirects, bad practices, SEO and security issues, heading structures, etc.)
📧 Reporter - interactive HTML audit report, structured JSON, and colored text output; built-in SMTP mailer sends HTML reports directly to your inbox
💾 Offline website generator - clone entire websites to browsable local HTML files (no server needed) including all assets. Supports multi-domain clones — include subdomains or external domains with intelligent cross-linking.
📝 Website to markdown converter - export the entire website to browsable text markdown (viewable on GitHub or any text editor), or generate a single-file markdown with smart header/footer deduplication — ideal for feeding to AI tools. Includes a built-in web server that renders markdown exports as styled HTML pages. See markdown examples.
🗺️ Sitemap generator - allows you to generate sitemap.xml and sitemap.txt files with a list of all pages on your website
🏆 Quality scoring - automatic quality scoring (0.0-10.0) across 5 categories: Performance, SEO, Security, Accessibility, Best Practices
🔄 CI/CD quality gate - configurable thresholds with exit code 10 on failure for automated pipelines; also useful as a post-deployment cache warmer for reverse proxies and CDNs

The following features are summarized in greater detail:

🕷️ Crawler

all major platforms supported without dependencies (🐧 Linux, 🪟 Windows, 🍎 macOS, arm64) — single native binary
has incredible 🚀 native Rust performance with async I/O and multi-threaded crawling
provides simulation of different device types (desktop/mobile/tablet) thanks to predefined User-Agents
will crawl all files, styles, scripts, fonts, images, documents, etc. on your website
will respect the robots.txt file and will not crawl the pages that are not allowed
has a beautiful interactive and 🎨 colourful output
it will clearly warn you ⚠️ of any wrong use of the tool (e.g. input parameters validation or wrong permissions)
as --url parameter, you can specify also a sitemap.xml file (or sitemap index), which will be processed as a list of URLs. In sitemap-only mode, the crawler follows only URLs from the sitemap — it does not discover additional links from HTML pages. Gzip-compressed sitemaps (*.xml.gz) are fully supported, both as direct URLs and when referenced from sitemap index files.
respects the HTML <base href> tag when resolving relative URLs on pages that use it.

🛠️ Dev/DevOps assistant

allows testing public and local projects on specific ports (e.g. http://localhost:3000/)
works as a stress/load tester — configure the number of concurrent workers (--workers) and the maximum requests per second (--max-reqs-per-sec) to simulate various traffic levels and test your infrastructure's resilience against high load or DoS scenarios
combine with rich filtering options — include/ignore URLs by regex (--include-regex, --ignore-regex), disable specific asset types (--disable-javascript, --disable-images, etc.), or limit crawl depth (--max-depth) to focus the load on specific parts of your website
will help you warm up the application cache or the cache on the reverse proxy of the entire website

📊 Analyzer

will find the weak points or strange behavior of your website
built-in analyzers cover SEO, security headers, accessibility, best practices, performance, SSL/TLS, caching, and more

📧 Reporter

Three output formats:

Interactive HTML report — a self-contained .html file with sortable tables, quality scores, color-coded findings, and sections for SEO, security, accessibility, performance, headers, redirects, 404s, and more. Open it in any browser — no server needed.
JSON output — structured data with all crawled URLs, response details, analysis findings, scores, and CI/CD gate results. Ideal for programmatic consumption, dashboards, and integrations.
Text output — human-readable colored terminal output with tables, progress bars, and summaries.

Additional reporting features:

Built-in SMTP mailer — send the HTML audit report directly to one or more email addresses via your own SMTP server. Configure sender, recipients, subject template, and SMTP credentials via CLI options.
will provide you with data for SEO analysis, just add the Title, Keywords and Description extra columns
will provide useful summaries and statistics at the end of the processing

💾 Offline website generator

will help you export the entire website to offline form, where it is possible to browse the site through local HTML files (without HTTP server) including all documents, images, styles, scripts, fonts, etc.
supports multi-domain clones — include subdomains (*.mysite.tld) or entirely different domains in a single offline export. All URLs across included domains are intelligently rewritten to relative paths, so the resulting offline version cross-links pages between domains seamlessly — you get one unified browsable clone.
you can limit what assets you want to download and export (see --disable-* directives) .. for some types of websites the best result is with the --disable-javascript option.
you can specify by --allowed-domain-for-external-files (short -adf) from which external domains it is possible to download assets (JS, CSS, fonts, images, documents) including * option for all domains.
you can specify by --allowed-domain-for-crawling (short -adc) which other domains should be included in the crawling if there are any links pointing to them. You can enable e.g. mysite.* to export all language mutations that have a different TLD or *.mysite.tld to export all subdomains.
you can use --single-page to export only one page to which the URL is given (and its assets), but do not follow other pages.
you can use --single-foreign-page to export only one page from another domain (if allowed by --allowed-domain-for-crawling), but do not follow other pages.
you can use --replace-content to replace content in HTML/JS/CSS with foo -> bar or regexp in PCRE format, e.g. /card[0-9]/i -> card. Can be specified multiple times.
you can use --replace-query-string to replace chars in query string in the filename.
you can use --max-depth to set the maximum crawling depth (for pages, not assets). 1 means /about or /about/, 2 means /about/contacts etc.
you can use it to export your website to a static form and host it on GitHub Pages, Netlify, Vercel, etc. as a static backup and part of your disaster recovery plan or archival/legal needs
works great with older conventional websites but also modern ones, built on frameworks like Next.js, Nuxt.js, SvelteKit, Astro, Gatsby, etc. When a JS framework is detected, the export also performs some framework-specific code modifications for optimal results.
try it for your website, and you will be very pleasantly surprised :-)

📝 Website to markdown converter

Two export modes:

Multi-file markdown — exports the entire website with all subpages to a directory of browsable .md files. The markdown renders nicely when uploaded to GitHub, viewed in VS Code, or any text editor. Links between pages are converted to relative .md links so you can navigate between files. Optionally includes images and other files (PDF, etc.).
Single-file markdown — combines all pages into one large markdown file with smart removal of duplicate website headers and footers across pages. Ideal for feeding entire website content to AI tools (ChatGPT, Claude, etc.) that process markdown more effectively than raw HTML.

Smart conversion features:

collapsible accordions — large link lists (menus, navigation, footer links with 8+ items) are automatically collapsed into <details> accordions with contextual labels ("Menu", "Links") for better readability
content before the main heading (typically h1) — such as the site header and navigation — is moved to the end of the page below a --- separator, so the actual page content comes first
you can set multiple selectors (CSS-like) to remove unwanted elements from the exported markdown
code block detection and syntax highlighting for popular programming languages
HTML tables are converted to proper markdown tables

Built-in web server:

use --serve-markdown=<dir> to start a built-in HTTP server that renders your markdown export as styled HTML pages with tables, dark/light mode, breadcrumb navigation, and accordion support — perfect for browsing and sharing the export locally or on a network

💡 Tip: you can push the exported markdown folder to your GitHub repository, where it will be automatically rendered as a browsable documentation. You can look at the examples of converted websites to markdown.

See all available markdown exporter options.

🗺️ Sitemap generator

will help you create a sitemap.xml and sitemap.txt for your website
you can set the priority of individual pages based on the number of slashes in the URL

Don't hesitate and try it. You will love it as we do! ❤️

🚀 Installation

📦 Pre-built binaries

Download pre-built binaries from 🐙 GitHub releases for all major platforms (🐧 Linux, 🪟 Windows, 🍎 macOS, x64 & arm64).

The binary is self-contained — no runtime dependencies required.

# Linux / macOS — download, extract, run
./siteone-crawler --url=https://my.domain.tld

Note for macOS users: In case that Mac refuses to start the crawler from your Download folder, move the entire folder with the Crawler via the terminal to another location, for example to the homefolder ~.

🍺 Homebrew (macOS / Linux)

brew install janreges/tap/siteone-crawler
siteone-crawler --url=https://my.domain.tld

🐧 Debian / Ubuntu (apt)

curl -1sLf 'https://dl.cloudsmith.io/public/janreges/siteone-crawler/setup.deb.sh' | sudo -E bash
sudo apt-get install siteone-crawler

🎩 Fedora / RHEL (dnf)

curl -1sLf 'https://dl.cloudsmith.io/public/janreges/siteone-crawler/setup.rpm.sh' | sudo -E bash
sudo dnf install siteone-crawler

🦎 openSUSE / SLES (zypper)

curl -1sLf 'https://dl.cloudsmith.io/public/janreges/siteone-crawler/setup.rpm.sh' | sudo -E bash
sudo zypper install siteone-crawler

🏔️ Alpine Linux (apk)

curl -1sLf 'https://dl.cloudsmith.io/public/janreges/siteone-crawler/setup.alpine.sh' | sudo -E bash
sudo apk add siteone-crawler

🔨 Build from source

Requires Rust 1.85 or later.

git clone https://github.com/janreges/siteone-crawler.git
cd siteone-crawler

# Build optimized release binary
cargo build --release

# Run
./target/release/siteone-crawler --url=https://my.domain.tld

▶️ Usage

Interactive wizard

Run the binary without any arguments and an interactive wizard will guide you through the configuration. Choose from 10 preset modes, enter the target URL, fine-tune settings with arrow keys, and the crawler starts immediately — no need to remember CLI flags.

? Choose a crawl mode:
❯ Quick Audit               Fast site health overview — crawls all pages and assets
  SEO Analysis               Extract titles, descriptions, keywords, and OpenGraph tags
  Performance Test           Measure response times with cache disabled — find bottlenecks
  Security Check             Check SSL/TLS, security headers, and redirects site-wide
  Offline Clone              Download entire website with all assets for offline browsing
  Markdown Export            Convert pages to Markdown for AI models or documentation
  Stress Test                High-concurrency load test with cache-busting random params
  Single Page                Deep analysis of a single URL — SEO, security, performance
  Large Site Crawl           High-throughput HTML-only crawl for large sites (100k+ pages)
  Custom                     Start from defaults and configure every option manually
  ──────────────────────────────────────
  Browse offline export      Serve a previously exported offline site via HTTP
  Browse markdown export     Serve a previously exported markdown site via HTTP
[↑↓ to move, enter to select, type to filter]

After selecting a preset and entering the URL, the wizard shows a settings form where you can adjust workers, timeout, content types, export options, and more. A configuration summary with the equivalent CLI command is displayed before the crawl starts — copy it for future use without the wizard.

If existing offline or markdown exports are detected in ./tmp/, the wizard also offers to serve them via the built-in HTTP server directly from the menu.

Basic example

To run the crawler from the command line, provide the required arguments:

./siteone-crawler --url=https://mydomain.tld/ --device=mobile

CI/CD example

# Fail deployment if quality score < 7.0 or any 5xx errors
./siteone-crawler --url=https://mydomain.tld/ --ci --ci-min-score=7.0 --ci-max-5xx=0
echo $?  # 0 = pass, 10 = fail

Fully-featured example

./siteone-crawler --url=https://mydomain.tld/ \
  --output=text \
  --workers=2 \
  --max-reqs-per-sec=10 \
  --memory-limit=2048M \
  --resolve='mydomain.tld:443:127.0.0.1' \
  --timeout=5 \
  --proxy=proxy.mydomain.tld:8080 \
  --http-auth=myuser:secretPassword123 \
  --user-agent="My User-Agent String" \
  --extra-columns="DOM,X-Cache(10),Title(40),Keywords(50),Description(50>),Heading1=xpath://h1/text()(20>),ProductPrice=regexp:/Price:\s*\$?(\d+(?:\.\d{2})?)/i#1(10)" \
  --accept-encoding="gzip, deflate" \
  --url-column-size=100 \
  --max-queue-length=3000 \
  --max-visited-urls=10000 \
  --max-url-length=5000 \
  --max-non200-responses-per-basename=10 \
  --include-regex="/^.*\/technologies.*/" \
  --include-regex="/^.*\/fashion.*/" \
  --ignore-regex="/^.*\/downloads\/.*\.pdf$/i" \
  --analyzer-filter-regex="/^.*$/i" \
  --remove-query-params \
  --add-random-query-params \
  --transform-url="live-site.com -> local-site.local" \
  --transform-url="/cdn\.live-site\.com/ -> local-site.local/cdn" \
  --show-scheme-and-host \
  --do-not-truncate-url \
  --output-html-report=tmp/myreport.html \
  --html-report-options="summary,seo-opengraph,visited-urls,security,redirects" \
  --output-json-file=/dir/report.json \
  --output-text-file=/dir/report.txt \
  --add-timestamp-to-output-file \
  --add-host-to-output-file \
  --offline-export-dir=tmp/mydomain.tld \
  --replace-content='/<foo[^>]+>/ -> <bar>' \
  --ignore-store-file-error \
  --sitemap-xml-file=/dir/sitemap.xml \
  --sitemap-txt-file=/dir/sitemap.txt \
  --sitemap-base-priority=0.5 \
  --sitemap-priority-increase=0.1 \
  --markdown-export-dir=tmp/mydomain.tld.md \
  --markdown-export-single-file=tmp/mydomain.tld.combined.md \
  --markdown-move-content-before-h1-to-end \
  --markdown-disable-images \
  --markdown-disable-files \
  --markdown-remove-links-and-images-from-single-file \
  --markdown-exclude-selector='.exclude-me' \
  --markdown-replace-content='/<foo[^>]+>/ -> <bar>' \
  --markdown-replace-query-string='/[a-z]+=[^&]*(&|$)/i -> $1__$2' \
  --mail-to=your.name@my-mail.tld \
  --mail-to=your.friend.name@my-mail.tld \
  --mail-from=crawler@my-mail.tld \
  --mail-from-name="SiteOne Crawler" \
  --mail-subject-template="Crawler Report for %domain% (%date%)" \
  --mail-smtp-host=smtp.my-mail.tld \
  --mail-smtp-port=25 \
  --mail-smtp-user=smtp.user \
  --mail-smtp-pass=secretPassword123 \
  --ci --ci-min-score=7.0 --ci-min-security=8.0

⚙️ Arguments

For a clearer list, I recommend going to the documentation: 🌐 https://crawler.siteone.io/configuration/command-line-options/

Basic settings

Parameter	Description
`--url=<url>`	Required. HTTP or HTTPS URL address of the website or sitemap xml to be crawled. Use quotation marks `''` if the URL contains query parameters.
`--single-page`	Load only one page to which the URL is given (and its assets), but do not follow other pages.
`--max-depth=<int>`	Maximum crawling depth (for pages, not assets). Default is `0` (no limit). `1` means `/about` or `/about/`, `2` means `/about/contacts` etc.
`--device=<val>`	Device type for choosing a predefined User-Agent. Ignored when `--user-agent` is defined. Supported values: `desktop`, `mobile`, `tablet`. Default is `desktop`.
`--user-agent=<val>`	Custom User-Agent header. Use quotation marks. If specified, it takes precedence over the device parameter. If you add `!` at the end, the siteone-crawler/version will not be added as a signature at the end of the final user-agent.
`--timeout=<int>`	Request timeout in seconds. Default is `5`.
`--proxy=<host:port>`	HTTP proxy to use in `host:port` format. Host can be hostname, IPv4 or IPv6.
`--http-auth=<user:pass>`	Basic HTTP authentication in `username:password` format.
`--config-file=<file>`	Load CLI options from a config file. One option per line, `#` comments allowed. Without this flag, auto-discovers `~/.siteone-crawler.conf` or `/etc/siteone-crawler.conf`. CLI arguments override config file values.

Output settings

Parameter	Description
`--output=<val>`	Output type. Supported values: `text`, `json`. Default is `text`.
`--extra-columns=<values>`	Comma delimited list of extra columns added to output table. You can specify HTTP headers (e.g. `X-Cache`), predefined values (`Title`, `Keywords`, `Description`, `DOM`), or custom extraction from text files (HTML, JS, CSS, TXT, JSON, XML, etc.) using XPath or regexp. For custom extraction, use the format `Custom_column_name=method:pattern#group(length)`, where `method` is `xpath` or `regexp`, `pattern` is the extraction pattern, an optional `#group` specifies the capturing group (or node index for XPath) to return (defaulting to the entire match or first node), and an optional `(length)` sets the maximum output length (append `>` to disable truncation). For example, use `Heading1=xpath://h1/text()(20>)` to extract the text of the first H1 element from the HTML document, and `ProductPrice=regexp:/Price:\s*\$?(\d+(?:\.\d{2})?)/i#1(10)` to extract a numeric price (e.g., "29.99") from a string like "Price: $29.99".
`--url-column-size=<num>`	Basic URL column width. By default, it is calculated from the size of your terminal window.
`--rows-limit=<num>`	Max. number of rows to display in tables with analysis results. Default is `200`.
`--timezone=<val>`	Timezone for datetimes in HTML reports and timestamps in output folders/files, e.g. `Europe/Prague`. Default is `UTC`.
`--do-not-truncate-url`	In the text output, long URLs are truncated by default to `--url-column-size` so the table does not wrap due to long URLs. With this option, you can turn off the truncation.
`--show-scheme-and-host`	On text output, show scheme and host also for origin domain URLs.
`--hide-progress-bar`	Hide progress bar visible in text and JSON output for more compact view.
`--hide-columns=<list>`	Hide specified columns from the progress table. Comma-separated list of column names: `type`, `time`, `size`, `cache`. Example: `--hide-columns=cache` or `--hide-columns=cache,type`.
`--no-color`	Disable colored output.
`--force-color`	Force colored output regardless of support detection.
`--show-inline-criticals`	Show criticals from the analyzer directly in the URL table.
`--show-inline-warnings`	Show warnings from the analyzer directly in the URL table.

Resource filtering

Parameter	Description
`--disable-all-assets`	Disables crawling of all assets and files and only crawls pages in href attributes. Shortcut for calling all other `--disable-*` flags.
`--disable-javascript`	Disables JavaScript downloading and removes all JavaScript code from HTML, including `onclick` and other `on*` handlers.
`--disable-styles`	Disables CSS file downloading and at the same time removes all style definitions by `<style>` tag or inline by style attributes.
`--disable-fonts`	Disables font downloading and also removes all font/font-face definitions from CSS.
`--disable-images`	Disables downloading of all images and replaces found images in HTML with placeholder image only.
`--disable-files`	Disables downloading of any files (typically downloadable documents) to which various links point.
`--remove-all-anchor-listeners`	On all links on the page remove any event listeners. Useful on some types of sites with modern JS frameworks that would like to compose content dynamically (React, Svelte, Vue, Angular, etc.).

Advanced crawler settings

Parameter	Description
`--workers=<int>`	Maximum number of concurrent workers (threads). Crawler will not make more simultaneous requests to the server than this number. Use carefully! A high number of workers can cause a DoS attack. Default is `3`.
`--max-reqs-per-sec=<val>`	Max requests/s for whole crawler. Be careful not to cause a DoS attack. Default value is `10`.
`--memory-limit=<size>`	Memory limit in units `M` (Megabytes) or `G` (Gigabytes). Default is `2048M`.
`--resolve=<host:port:ip>`	Custom DNS resolution in `domain:port:ip` format. Same as curl --resolve. Can be specified multiple times.
`--allowed-domain-for-external-files=<domain>`	Enable loading of file content from another domain (e.g. CDN). Can be specified multiple times. Use `*` for all domains.
`--allowed-domain-for-crawling=<domain>`	Allow crawling of other listed domains — typically language mutations on other domains. Can be specified multiple times. Use wildcards like `*.mysite.tld`.
`--single-foreign-page`	When crawling of other domains is allowed, ensures that only the linked page and its assets are crawled from foreign domains.
`--include-regex=<regex>`	PCRE-compatible regular expression for URLs that should be included. Can be specified multiple times. Example: `--include-regex='/^\/public\//'`
`--ignore-regex=<regex>`	PCRE-compatible regular expression for URLs that should be ignored. Can be specified multiple times.
`--regex-filtering-only-for-pages`	Apply `*-regex` rules only to page URLs, not static assets.
`--analyzer-filter-regex`	PCRE-compatible regular expression for filtering analyzers by name.
`--accept-encoding=<val>`	Custom `Accept-Encoding` request header. Default is `gzip, deflate, br`.
`--remove-query-params`	Remove query parameters from found URLs.
`--add-random-query-params`	Add random query parameters to each URL to bypass caches.
`--transform-url=<from->to>`	Transform URLs before crawling. Use `from -> to` for simple replacement or `/regex/ -> replacement`. Can be specified multiple times.
`--force-relative-urls`	Normalize all discovered URLs matching the initial domain (incl. www variant and protocol differences) to canonical form. Prevents duplicate files in offline export when the site uses inconsistent URL formats (http/https, www/non-www).
`--ignore-robots-txt`	Ignore robots.txt content.
`--http-cache-dir=<dir>`	Cache dir for HTTP responses. Disable with `--http-cache-dir='off'` or `--no-cache`. Default is `~/.cache/siteone-crawler/http-cache` (XDG-compliant, respects `$XDG_CACHE_HOME`).
`--http-cache-compression`	Enable compression for HTTP cache storage.
`--http-cache-ttl=<val>`	TTL for HTTP cache entries (e.g. `1h`, `7d`, `30m`). Use `0` for infinite. Default is `24h`.
`--no-cache`	Disable HTTP cache completely. Shortcut for `--http-cache-dir='off'`.
`--max-queue-length=<num>`	Maximum length of the waiting URL queue. Default is `9000`.
`--max-visited-urls=<num>`	Maximum number of visited URLs. Default is `10000`.
`--max-skipped-urls=<num>`	Maximum number of skipped URLs. Default is `10000`.
`--max-url-length=<num>`	Maximum supported URL length in chars. Default is `2083`.
`--max-non200-responses-per-basename=<num>`	Protection against looping with dynamic non-200 URLs. Default is `5`.

File export settings

Parameter	Description
`--output-html-report=<file>`	Save HTML report into that file. Set to empty `''` to disable HTML report. By default saved into `tmp/%domain%.report.%datetime%.html`.
`--html-report-options=<sections>`	Comma-separated list of sections to include in HTML report. Available sections: `summary`, `seo-opengraph`, `image-gallery`, `video-gallery`, `visited-urls`, `dns-ssl`, `crawler-stats`, `crawler-info`, `headers`, `content-types`, `skipped-urls`, `external-links`, `caching`, `best-practices`, `accessibility`, `security`, `redirects`, `404-pages`, `slowest-urls`, `fastest-urls`, `source-domains`. Default: all sections.
`--output-json-file=<file>`	File path for JSON output. Set to empty `''` to disable JSON file. By default saved into `tmp/%domain%.output.%datetime%.json`. See JSON Output Documentation for format details.
`--output-text-file=<file>`	File path for TXT output. Set to empty `''` to disable TXT file. By default saved into `tmp/%domain%.output.%datetime%.txt`. See Text Output Documentation for format details.
`--add-timestamp-to-output-file`	Append timestamp to output filenames (HTML report, JSON, TXT) except sitemaps.
`--add-host-to-output-file`	Append initial URL host to output filenames (HTML report, JSON, TXT) except sitemaps.

Default output directory: Report files are saved into ./tmp/ in the current working directory. If ./tmp/ cannot be created (e.g. read-only filesystem), the crawler falls back to the platform's XDG data directory (~/.local/share/siteone-crawler/ on Linux, ~/Library/Application Support/siteone-crawler/ on macOS, %APPDATA%\siteone-crawler\ on Windows) and prints a notice to stderr.

Mailer options

Parameter	Description
`--mail-to=<email>`	Recipients of HTML e-mail reports. Required for mailer activation. You can specify multiple emails separated by comma.
`--mail-from=<email>`	E-mail sender address. Default is `siteone-crawler@your-hostname.com`.
`--mail-from-name=<val>`	E-mail sender name. Default is `SiteOne Crawler`.
`--mail-subject-template=<val>`	E-mail subject template. You can use `%domain%`, `%date%` and `%datetime%`. Default is `Crawler Report for %domain% (%date%)`.
`--mail-smtp-host=<host>`	SMTP host for sending emails. Default is `localhost`.
`--mail-smtp-port=<port>`	SMTP port for sending emails. Default is `25`.
`--mail-smtp-user=<user>`	SMTP user, if your SMTP server requires authentication.
`--mail-smtp-pass=<pass>`	SMTP password, if your SMTP server requires authentication.

Upload options

Parameter	Description
`--upload`	Enable HTML report upload to `--upload-to`.
`--upload-to=<url>`	URL of the endpoint where to send the HTML report. Default is `https://crawler.siteone.io/up`.
`--upload-retention=<val>`	How long should the HTML report be kept in the online version? Values: 1h / 4h / 12h / 24h / 3d / 7d / 30d / 365d / forever. Default is `30d`.
`--upload-password=<val>`	Optional password (user will be 'crawler') to display the online HTML report.
`--upload-timeout=<int>`	Upload timeout in seconds. Default is `3600`.

Offline exporter options

Parameter	Description
`--offline-export-dir=<dir>`	Path to directory where to save the offline version of the website.
`--offline-export-store-only-url-regex=<regex>`	Debug: store only URLs matching these PCRE regexes. Can be specified multiple times.
`--offline-export-remove-unwanted-code=<1/0>`	Remove unwanted code for offline mode (analytics, social networks, etc.). Default is `1`.
`--offline-export-no-auto-redirect-html`	Disable automatic creation of redirect HTML files for subfolders containing `index.html`.
`--offline-export-preserve-url-structure`	Preserve the original URL path structure. E.g. `/about` is stored as `about/index.html` instead of `about.html`. Useful for web server deployment where the clone should maintain the same URL hierarchy as the original site.
`--replace-content=<val>`	Replace content in HTML/JS/CSS with `foo -> bar` or PCRE regexp. Can be specified multiple times.
`--replace-query-string=<val>`	Replace characters in query string filenames. Can be specified multiple times.
`--offline-export-lowercase`	Convert all filenames to lowercase for offline export. Useful for case-insensitive filesystems.
`--ignore-store-file-error`	Ignore any file storing errors and continue.
`--disable-astro-inline-modules`	Disable inlining of Astro module scripts for offline export. Scripts will remain as external files with corrected relative paths.

Markdown exporter options

Parameter	Description
`--markdown-export-dir=<dir>`	Path to directory where to save the markdown version of the website.
`--markdown-export-single-file=<file>`	Path to a file for combined markdown. Requires `--markdown-export-dir`.
`--markdown-move-content-before-h1-to-end`	Move content before main H1 heading to the end of the markdown.
`--markdown-disable-images`	Do not export and show images in markdown files.
`--markdown-disable-files`	Do not export files other than HTML/CSS/JS/fonts/images (e.g. PDF, ZIP).
`--markdown-remove-links-and-images-from-single-file`	Remove links and images from combined single file.
`--markdown-exclude-selector=<val>`	Exclude DOM elements by CSS selector from markdown export. Can be specified multiple times.
`--markdown-replace-content=<val>`	Replace text content with `foo -> bar` or PCRE regexp. Can be specified multiple times.
`--markdown-replace-query-string=<val>`	Replace characters in query string filenames. Can be specified multiple times.
`--markdown-export-store-only-url-regex=<regex>`	Debug: store only URLs matching these PCRE regexes. Can be specified multiple times.
`--markdown-ignore-store-file-error`	Ignore any file storing errors and continue.

Sitemap options

Parameter	Description
`--sitemap-xml-file=<file>`	File path for generated XML Sitemap. Extension `.xml` added if not specified.
`--sitemap-txt-file=<file>`	File path for generated TXT Sitemap. Extension `.txt` added if not specified.
`--sitemap-base-priority=<num>`	Base priority for XML sitemap. Default is `0.5`.
`--sitemap-priority-increase=<num>`	Priority increase based on slashes in URL. Default is `0.1`.

Expert options

Parameter	Description
`--debug`	Activate debug mode.
`--debug-log-file=<file>`	Log file for debug messages. When set without `--debug`, logging is active without visible output.
`--debug-url-regex=<regex>`	Regex for URL(s) to debug. Can be specified multiple times.
`--result-storage=<val>`	Result storage type. Values: `memory` or `file`. Use `file` for large websites. Default is `memory`.
`--result-storage-dir=<dir>`	Directory for `--result-storage=file`. Default is `tmp/result-storage`.
`--result-storage-compression`	Enable compression for results storage.
`--http-cache-dir=<dir>`	Cache dir for HTTP responses. Disable with `--http-cache-dir='off'` or `--no-cache`. Default is `~/.cache/siteone-crawler/http-cache` (XDG-compliant, respects `$XDG_CACHE_HOME`).
`--http-cache-compression`	Enable compression for HTTP cache storage.
`--http-cache-ttl=<val>`	TTL for HTTP cache entries (e.g. `1h`, `7d`, `30m`). Use `0` for infinite. Default is `24h`.
`--websocket-server=<host:port>`	Start crawler with websocket server on given host:port.
`--console-width=<int>`	Enforce a fixed console width, disabling automatic detection.

Fastest URL analyzer

Parameter	Description
`--fastest-urls-top-limit=<int>`	Number of URLs in TOP fastest list. Default is `20`.
`--fastest-urls-max-time=<val>`	Maximum response time for an URL to be considered fast. Default is `1`.

SEO and OpenGraph analyzer

Parameter	Description
`--max-heading-level=<int>`	Max heading level from 1 to 6 for analysis. Default is `3`.

Slowest URL analyzer

Parameter	Description
`--slowest-urls-top-limit=<int>`	Number of URLs in TOP slowest list. Default is `20`.
`--slowest-urls-min-time=<val>`	Minimum response time threshold for slow URLs. Default is `0.01`.
`--slowest-urls-max-time=<val>`	Maximum response time for very slow evaluation. Default is `3`.

Built-in HTTP server

Browse exported markdown or offline HTML files through a local web server with a built-in viewer.

Parameter	Description
`--serve-markdown=<dir>`	Start built-in HTTP server for browsing a markdown export directory. Renders `.md` files as styled HTML with tables, accordions, dark/light mode, and breadcrumb navigation.
`--serve-offline=<dir>`	Start built-in HTTP server for browsing an offline HTML export directory. Serves static files with Content-Security-Policy restricting assets to the same origin.
`--serve-port=<int>`	Port for the built-in HTTP server. Default is `8321`.
`--serve-bind-address=<addr>`	Bind address for the built-in HTTP server. Default is `127.0.0.1` (localhost only). Use `0.0.0.0` to listen on all network interfaces and their IP addresses.

Example:

# Browse markdown export
./siteone-crawler --serve-markdown=./exports/markdown

# Browse offline export on custom port, accessible from network
./siteone-crawler --serve-offline=./exports/offline --serve-port=9000 --serve-bind-address=0.0.0.0

CI/CD settings

Parameter	Description
`--ci`	Enable CI/CD quality gate. Crawler exits with code 10 if thresholds are not met. Default file outputs (HTML, JSON, TXT reports) are suppressed unless explicitly requested via `--output-*` options.
`--ci-min-score=<val>`	Minimum overall quality score (0.0-10.0). Default is `5.0`.
`--ci-min-performance=<val>`	Minimum Performance category score (0.0-10.0). Default is `5.0`.
`--ci-min-seo=<val>`	Minimum SEO category score (0.0-10.0). Default is `5.0`.
`--ci-min-security=<val>`	Minimum Security category score (0.0-10.0). Default is `5.0`.
`--ci-min-accessibility=<val>`	Minimum Accessibility category score (0.0-10.0). Default is `3.0`.
`--ci-min-best-practices=<val>`	Minimum Best Practices category score (0.0-10.0). Default is `5.0`.
`--ci-max-404=<int>`	Maximum number of 404 responses allowed. Default is `0`.
`--ci-max-5xx=<int>`	Maximum number of 5xx server error responses allowed. Default is `0`.
`--ci-max-criticals=<int>`	Maximum number of critical analysis findings allowed. Default is `0`.
`--ci-max-warnings=<int>`	Maximum number of warning analysis findings allowed. Not checked by default.
`--ci-max-avg-response=<val>`	Maximum average response time in seconds. Not checked by default.
`--ci-min-pages=<int>`	Minimum number of HTML pages that must be found. Default is `10`.
`--ci-min-assets=<int>`	Minimum number of assets (JS, CSS, images, fonts) that must be found. Default is `10`.
`--ci-min-documents=<int>`	Minimum number of documents (PDF, etc.) that must be found. Default is `0` (not checked).

Default behavior with --ci alone: overall score >= 5.0, each category score >= 5.0 (Performance, SEO, Security, Best Practices) and Accessibility >= 3.0, 404 errors <= 0, 5xx errors <= 0, critical findings <= 0, HTML pages >= 10, assets >= 10. File outputs (HTML, JSON, TXT reports) are not generated. To save reports in CI mode, specify the desired output explicitly, e.g. --ci --output-html-report=report.html.

🏆 Quality Scoring

The crawler automatically calculates a quality score (0.0-10.0) across 5 weighted categories:

Category	Weight	What it measures
Performance	20%	Response times, slow URLs
SEO	20%	Missing H1, title uniqueness, meta descriptions, 404s, redirects
Security	25%	SSL/TLS certificates, security headers, unsafe protocols
Accessibility	20%	Lang attribute, image alt text, form labels, ARIA, heading levels
Best Practices	15%	Duplicate/large SVGs, deep DOM, Brotli/WebP support

The overall score is a weighted average of all categories. Scores are displayed in a colored box in the console output and included in JSON and HTML report outputs.

Score labels:

9.0-10.0 — Excellent (green)
7.0-8.9 — Good (blue)
5.0-6.9 — Fair (yellow)
3.0-4.9 — Poor (purple)
0.0-2.9 — Critical (red)

🔄 CI/CD Integration

The --ci flag enables a quality gate that evaluates configurable thresholds after crawling completes. When any threshold is not met, the crawler exits with code 10 (distinct from exit code 1 for runtime errors). In CI mode, default file outputs (HTML, JSON, TXT reports) are automatically suppressed — only the console output and exit code matter. If you need report files in CI, specify them explicitly (e.g. --output-html-report=report.html).

Bonus: Cache warming — running the crawler as a post-deployment step in your CI/CD pipeline crawls every page and asset on your site, which populates the HTML/asset cache on your reverse proxy (Varnish, Nginx) or CDN (Cloudflare, CloudFront). This way, the first real visitors always hit a warm cache instead of cold origin requests.

Exit codes

Code	Meaning
`0`	Success (with `--ci` this also means all quality thresholds passed)
`1`	Runtime error
`2`	Help/version displayed
`3`	No pages crawled (e.g. DNS failure, timeout, connection refused)
`10`	CI/CD quality gate failed
`101`	Configuration error

Example: GitHub Actions

- name: Check website quality
  run: |
    ./siteone-crawler \
      --url=https://staging.example.com \
      --ci \
      --ci-min-score=7.0 \
      --ci-min-security=8.0 \
      --ci-max-404=0 \
      --ci-max-5xx=0

Example: GitLab CI

quality_check:
  script:
    - ./siteone-crawler --url=$STAGING_URL --ci --ci-min-score=6.0
  allow_failure: false

Console output

When --ci is enabled, a quality gate box is displayed after the quality scores:

╔══════════════════════════════════════════════════════════════╗
║                      CI/CD QUALITY GATE                      ║
╠══════════════════════════════════════════════════════════════╣
║  [PASS] Overall score: 7.2 >= 5                              ║
║  [PASS] 404 errors: 0 <= 0                                   ║
║  [PASS] 5xx errors: 0 <= 0                                   ║
║  [FAIL] Critical findings: 2 > 0 (max: 0)                    ║
╠══════════════════════════════════════════════════════════════╣
║  RESULT: FAIL (1 of 4 checks failed) — exit code 10          ║
╚══════════════════════════════════════════════════════════════╝

JSON output

When using --output=json --ci, the JSON includes a ciGate object:

{
  "ciGate": {
    "passed": false,
    "exitCode": 10,
    "checks": [
      {"metric": "Overall score", "operator": ">=", "threshold": 5.0, "actual": 7.2, "passed": true},
      {"metric": "404 errors", "operator": "<=", "threshold": 0.0, "actual": 0.0, "passed": true},
      {"metric": "Critical findings", "operator": "<=", "threshold": 0.0, "actual": 2.0, "passed": false}
    ]
  }
}

📄 Output Examples

To understand the richness of the data provided by the crawler, you can examine real output examples generated from crawling crawler.siteone.io:

Text Output Example: docs/OUTPUT-crawler.siteone.io.txt
- Provides a human-readable summary suitable for quick review.
- See the detailed Text Output Documentation.
JSON Output Example: docs/OUTPUT-crawler.siteone.io.json
- Provides structured data ideal for programmatic consumption and detailed analysis.
- See the detailed JSON Output Documentation.

These examples showcase the various tables and metrics generated, demonstrating the tool's capabilities in analyzing website structure, performance, SEO, security, and more.

🧪 Testing

cargo test                                       # unit tests + offline integration tests
cargo test --test integration_crawl -- --ignored --test-threads=1  # network integration tests (crawls crawler.siteone.io)

Unit tests live in each source file (#[cfg(test)] mod tests). Integration tests are in tests/integration_crawl.rs — network-dependent tests are #[ignore] by default so that cargo test stays fast and offline.

⚠️ Disclaimer

Please use responsibly and ensure that you have the necessary permissions when crawling websites. Some sites may have rules against automated access detailed in their robots.txt.

The author is not responsible for any consequences caused by inappropriate use or deliberate misuse of this tool.

📜 License

This work is licensed under a license.

Powered by

Package repository hosting is graciously provided by Cloudsmith. Cloudsmith is the only fully hosted, cloud-native, universal package management solution, that enables your organization to create, store and share packages in any format, to any place, with total confidence.

Name		Name	Last commit message	Last commit date
Latest commit History 453 Commits
.githooks		.githooks
.github/workflows		.github/workflows
docs		docs
src		src
tests		tests
tmp		tmp
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
CLAUDE.md		CLAUDE.md
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
LICENSE		LICENSE
README.md		README.md
rustfmt.toml		rustfmt.toml

Folders and files

Latest commit

History

Repository files navigation

SiteOne Crawler

Table of contents

✨ Features

🕷️ Crawler

🛠️ Dev/DevOps assistant

📊 Analyzer

📧 Reporter

💾 Offline website generator

📝 Website to markdown converter

🗺️ Sitemap generator

🚀 Installation

📦 Pre-built binaries

🍺 Homebrew (macOS / Linux)

🐧 Debian / Ubuntu (apt)

🎩 Fedora / RHEL (dnf)

🦎 openSUSE / SLES (zypper)

🏔️ Alpine Linux (apk)

🔨 Build from source

▶️ Usage

Interactive wizard

Basic example

CI/CD example

Fully-featured example

⚙️ Arguments

Basic settings

Output settings

Resource filtering

Advanced crawler settings

File export settings

Mailer options

Upload options

Offline exporter options

Markdown exporter options

Sitemap options

Expert options

Fastest URL analyzer

SEO and OpenGraph analyzer

Slowest URL analyzer

Built-in HTTP server

CI/CD settings

🏆 Quality Scoring

🔄 CI/CD Integration

Exit codes

Example: GitHub Actions

Example: GitLab CI

Console output

JSON output

📄 Output Examples

🧪 Testing

⚠️ Disclaimer

📜 License

Powered by

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 11

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages