Skip to content

Latest commit

 

History

History
750 lines (617 loc) · 38.5 KB

File metadata and controls

750 lines (617 loc) · 38.5 KB

SiteOne Crawler: JSON Output Documentation

Table of Contents

This document describes the structure and content of the JSON output file generated by the SiteOne Crawler. This JSON file contains detailed information about the crawled website, including metadata about the crawl process, results for each visited URL, quality scores, summary findings, and various analysis tables.

1. Introduction

The JSON output provides a comprehensive dataset about the crawled website. Key information includes:

  • Crawl Metadata: Details about the crawler execution, such as version, execution time, command used, hostname, and the final user agent.
  • Options: A complete record of all CLI configuration values used for the crawl.
  • Quality Scores: Overall and per-category quality scores (0-10) with deduction details.
  • Visited URL Results: For each URL visited during the crawl:
    • URL address
    • HTTP status code
    • Elapsed time for the request (performance)
    • Size of the response body
    • Content type (HTML, CSS, JS, Image, etc.)
    • Caching information (cache flags, lifetime)
    • Additional analysis results stored in the extras field.
  • Stats: Aggregate statistics about the crawl (total URLs, sizes, timings, status code counts).
  • Summary: A list of findings (OK, Warning, Critical, Info) that feed into quality scoring.
  • Analysis Tables: Aggregated data and specific findings presented in structured tables:
    • Skipped URLs: Reasons why certain URLs were not crawled (e.g., external domain, disallowed by robots.txt, specific rules).
    • Redirects: List of URLs that resulted in redirects (3xx status codes).
    • 404 Errors: List of URLs that resulted in a 404 Not Found status.
    • SSL/TLS Info: Details about the website's SSL certificate (issuer, subject, validity dates, supported protocols).
    • Performance: Tables listing the fastest and slowest URLs encountered during the crawl.
    • SEO & Content:
      • SEO metadata (title, description, keywords, H1, indexing directives) for HTML pages.
      • OpenGraph and Twitter Card metadata.
      • Heading structure analysis (correctness of H1-H6 hierarchy).
      • Analysis of non-unique titles and descriptions across pages.
    • Technical Details:
      • HTTP Headers: Summary of headers found, their occurrences, and unique values.
      • Caching Analysis: Breakdown of caching strategies by content type and domain.
      • DNS Information: DNS resolution details for the target domain.
      • Security Analysis: Evaluation of security-related HTTP headers.
      • External URLs: List of external URLs discovered during the crawl.
    • Crawler Statistics: Performance metrics for the crawler itself, individual analyzers, and content processors.

2. Potential Use Cases

The detailed data within the JSON output enables a wide variety of use cases:

  1. Comprehensive SEO Audits: Analyze titles, descriptions, heading structures, indexing status, and OpenGraph tags across the entire site.
  2. Performance Monitoring & Optimization: Identify the slowest pages and resources, analyze load times, and check caching headers.
  3. Broken Link Checking: Easily extract lists of all 404 errors and the pages where they were found.
  4. Redirect Chain Analysis: Identify and analyze redirect chains.
  5. Security Header Audits: Verify the implementation of crucial security headers (CSP, HSTS, X-Frame-Options, etc.) across the site.
  6. Content Inventory & Analysis: Get a list of all crawled resources, their types, sizes, and status codes. Analyze content type distribution.
  7. Website Archiving/Cloning: While the crawler has a dedicated offline export, the JSON contains the list of all discovered resources, which could inform a custom archiving process.
  8. Competitive Analysis: Run the crawler on competitor sites (respecting their robots.txt) to gather insights into their structure, performance, and technology.
  9. CI/CD Integration: Integrate the crawler into deployment pipelines to automatically check for new errors (404s, performance regressions) after deployments. Use quality scores and thresholds for automated pass/fail decisions.
  10. Technical Debt Assessment: Identify outdated practices, missing security headers, or performance issues that need addressing.

3. Detailed JSON Structure

The JSON output has 8 top-level keys:

3.1. crawler (Object)

Contains metadata about the crawler execution:

  • name (String): Name of the crawler software.
  • version (String): Version of the crawler.
  • executedAt (String): Timestamp when the crawl was executed, in the format "YYYY-MM-DD HH:MM:SS" (space separator, no timezone). Example: "2026-03-16 14:55:13".
  • command (String): The command-line arguments used to run the crawl.
  • hostname (String): The hostname where the crawler was run.
  • finalUserAgent (String): The User-Agent string used for the HTTP requests.

3.2. extraColumnsFromAnalysis (Array)

An array of objects defining extra columns that might be added during specific analyses. These are primarily intended for augmenting report outputs. Each object contains:

  • name (String): The display name of the column.
  • length (Integer): Suggested display length/width.
  • truncate (Boolean): Whether the content should be truncated if it exceeds the length.
  • customMethod, customPattern, customGroup: Fields used for custom data extraction logic (null when not configured).

3.3. options (Object)

A flat object containing all 132 CLI configuration values used for the crawl. Every option from the command line (or its default value) is recorded here. Keys are the option names in camelCase (e.g., url, workers, maxReqsPerSec, timeout, outputType, userAgent, acceptEncoding, etc.). Values are strings, integers, booleans, or null, depending on the option type.

This is useful for reproducing a crawl or understanding the exact configuration that produced the results.

3.4. qualityScores (Object)

Contains overall and per-category quality scores computed after analysis.

  • overall (Object): The aggregate quality score.

    • score (Float): Overall score from 0.0 to 10.0.
    • label (String): Human-readable label (e.g., "A+", "A", "B", "C", "D", "F").
    • weight (Float): Total weight (1.0 for overall).
    • deductions (Array): Array of objects, each with:
      • points (Float): Number of points deducted.
      • reason (String): Explanation for the deduction.
  • categories (Array): Array of 5 category objects, each with:

    • code (String): Category identifier. One of: "performance", "seo", "security", "accessibility", "bestPractices".
    • name (String): Human-readable category name.
    • score (Float): Category score from 0.0 to 10.0.
    • label (String): Human-readable label.
    • weight (Float): Weight of this category in the overall score (e.g., 0.20 for SEO, 0.25 for Security).
    • deductions (Array): Array of deduction objects (same structure as overall deductions).

3.5. results (Array)

An array of objects, where each object represents a single visited URL.

  • url (String): The absolute URL that was visited.
  • status (String): The HTTP status code returned (e.g., "200", "404").
  • elapsedTime (Float): Time taken to fetch the URL in seconds (e.g., 0.005).
  • size (Integer): Size of the response body in bytes (e.g., 50961).
  • type (Integer): An enum representing the detected content type:
    • 1: HTML
    • 2: JavaScript
    • 3: CSS
    • 4: Image
    • 7: Document (e.g., robots.txt)
    • 8: JSON
    • Other types may exist (Audio, Font, Video, XML, Redirect, Other).
  • cacheTypeFlags (Integer): Bitmask representing detected caching mechanisms (e.g., Cache-Control, ETag, Last-Modified). For example, 31 typically means Cache-Control + ETag + Last-Modified are all present. 32768 might indicate no caching headers found.
  • cacheLifetime (Integer): Cache lifetime in seconds derived from Cache-Control: max-age or Expires header. 0 if no lifetime could be determined.
  • extras (Array): Contains additional data from specific analyzers run on this URL. Typically an empty array [].

3.6. stats (Object)

Aggregate statistics about the entire crawl:

  • totalUrls (Integer): Total number of URLs visited.
  • totalSize (Integer): Total size of all responses in bytes.
  • totalSizeFormatted (String): Human-readable formatted total size (e.g., "31.33 MB").
  • totalExecutionTime (Float): Total wall-clock execution time in seconds.
  • totalRequestsTimes (Float): Sum of all individual request times in seconds.
  • totalRequestsTimesAvg (Float): Average request time in seconds.
  • totalRequestsTimesMin (Float): Minimum request time in seconds.
  • totalRequestsTimesMax (Float): Maximum request time in seconds.
  • countByStatus (Object): An object mapping HTTP status codes to counts. Keys are status code strings (e.g., "200", "404", "429"), values are integers. Only status codes that were actually encountered appear as keys.

3.7. summary (Object)

Contains a list of summary findings that feed into quality scoring.

  • items (Array): Array of finding objects, each with:
    • aplCode (String): A unique code identifying the finding (e.g., "s201", "s404", "s502").
    • status (String): Severity level. One of: "CRITICAL", "WARNING", "OK", "INFO".
    • text (String): Human-readable description of the finding (e.g., "Brotli is supported for HTML", "1 URL(s) returned a 404 status code").

3.8. tables (Object)

An object where each key is a table identifier (e.g., skipped-summary, 404, seo) and the value is an object describing that table. Each table object contains:

  • aplCode (String): A unique code for the table.
  • title (String): A human-readable title for the table.
  • columns (Object): An object describing the columns of the table. Each key is a column identifier (e.g., reason, url, statusCode). The value is an object detailing the column:
    • aplCode (String): Unique code for the column.
    • name (String): Display name for the column header.
    • width (Integer): Suggested display width (-1 might mean auto).
    • formatter (Object | null): Defines how the data should be formatted (e.g., adding units like 'ms' or 'kB'). Empty object {} indicates default formatting.
    • renderer (Object | null): Defines how the data should be rendered (e.g., adding color or links). Empty object {} indicates default rendering.
    • truncateIfLonger (Boolean): Whether to truncate the value if it exceeds the width.
    • Other fields like formatterWillChangeValueLength, nonBreakingSpaces, escapeOutputHtml, getDataValueCallback, forcedDataType provide more hints for rendering.
  • rows (Array): An array of objects, where each object represents a row in the table. The keys in each row object correspond to the column identifiers defined in columns. Important: All values in all table rows are strings, regardless of whether the data represents a number, count, or other type. For example, a count of 51 appears as "51", a request time of 0.003 appears as "0.003", and an empty value appears as "". Rows may also contain extra keys beyond the declared columns (see individual table descriptions for details).
  • position (String): A hint about where this table should typically be positioned in a report (e.g., before-url-table, after-url-table).

Note: The specific content and structure within tables depend on the analyzers enabled during the crawl. The set of tables may vary depending on what data was encountered (e.g., certificate-info only appears for HTTPS sites).

4. JSON Schema (Draft)

This is a draft JSON schema based on the actual output. It may need refinement for edge cases.

{
  "$schema": "http://json-schema.org/draft-07/schema#",
  "title": "SiteOne Crawler JSON Output",
  "description": "Schema for the JSON output file generated by SiteOne Crawler.",
  "type": "object",
  "properties": {
    "crawler": {
      "description": "Metadata about the crawler execution.",
      "type": "object",
      "properties": {
        "name": { "type": "string" },
        "version": { "type": "string" },
        "executedAt": { "type": "string", "description": "Format: YYYY-MM-DD HH:MM:SS" },
        "command": { "type": "string" },
        "hostname": { "type": "string" },
        "finalUserAgent": { "type": "string" }
      },
      "required": ["name", "version", "executedAt", "command", "hostname", "finalUserAgent"]
    },
    "extraColumnsFromAnalysis": {
      "description": "Definitions for extra columns used in analyses.",
      "type": "array",
      "items": {
        "type": "object",
        "properties": {
          "name": { "type": "string" },
          "length": { "type": "integer" },
          "truncate": { "type": "boolean" },
          "customMethod": { "type": ["string", "null"] },
          "customPattern": { "type": ["string", "null"] },
          "customGroup": { "type": ["string", "null"] }
        },
        "required": ["name", "length", "truncate"]
      }
    },
    "options": {
      "description": "All CLI configuration values used for the crawl.",
      "type": "object",
      "additionalProperties": true
    },
    "qualityScores": {
      "description": "Overall and per-category quality scores.",
      "type": "object",
      "properties": {
        "overall": {
          "type": "object",
          "properties": {
            "score": { "type": "number" },
            "label": { "type": "string" },
            "weight": { "type": "number" },
            "deductions": {
              "type": "array",
              "items": {
                "type": "object",
                "properties": {
                  "points": { "type": "number" },
                  "reason": { "type": "string" }
                },
                "required": ["points", "reason"]
              }
            }
          },
          "required": ["score", "label", "weight", "deductions"]
        },
        "categories": {
          "type": "array",
          "items": {
            "type": "object",
            "properties": {
              "code": { "type": "string", "enum": ["performance", "seo", "security", "accessibility", "bestPractices"] },
              "name": { "type": "string" },
              "score": { "type": "number" },
              "label": { "type": "string" },
              "weight": { "type": "number" },
              "deductions": {
                "type": "array",
                "items": {
                  "type": "object",
                  "properties": {
                    "points": { "type": "number" },
                    "reason": { "type": "string" }
                  },
                  "required": ["points", "reason"]
                }
              }
            },
            "required": ["code", "name", "score", "label", "weight", "deductions"]
          }
        }
      },
      "required": ["overall", "categories"]
    },
    "results": {
      "description": "Array of results for each visited URL.",
      "type": "array",
      "items": {
        "type": "object",
        "properties": {
          "url": { "type": "string", "format": "uri" },
          "status": { "type": "string" },
          "elapsedTime": { "type": "number" },
          "size": { "type": "integer" },
          "type": { "type": "integer", "description": "Enum for content type (1:HTML, 2:JS, 3:CSS, 4:Image, 7:Document, 8:JSON, ...)" },
          "cacheTypeFlags": { "type": "integer", "description": "Bitmask for caching mechanisms" },
          "cacheLifetime": { "type": "integer", "description": "Cache lifetime in seconds, 0 if undetermined" },
          "extras": {
            "type": "array",
            "description": "Additional analysis data for this URL (typically empty)"
          }
        },
        "required": ["url", "status", "elapsedTime", "size", "type", "cacheTypeFlags", "cacheLifetime", "extras"]
      }
    },
    "stats": {
      "description": "Aggregate crawl statistics.",
      "type": "object",
      "properties": {
        "totalUrls": { "type": "integer" },
        "totalSize": { "type": "integer" },
        "totalSizeFormatted": { "type": "string" },
        "totalExecutionTime": { "type": "number" },
        "totalRequestsTimes": { "type": "number" },
        "totalRequestsTimesAvg": { "type": "number" },
        "totalRequestsTimesMin": { "type": "number" },
        "totalRequestsTimesMax": { "type": "number" },
        "countByStatus": {
          "type": "object",
          "additionalProperties": { "type": "integer" }
        }
      },
      "required": ["totalUrls", "totalSize", "totalSizeFormatted", "totalExecutionTime", "totalRequestsTimes", "totalRequestsTimesAvg", "totalRequestsTimesMin", "totalRequestsTimesMax", "countByStatus"]
    },
    "summary": {
      "description": "Summary findings that feed into quality scoring.",
      "type": "object",
      "properties": {
        "items": {
          "type": "array",
          "items": {
            "type": "object",
            "properties": {
              "aplCode": { "type": "string" },
              "status": { "type": "string", "enum": ["CRITICAL", "WARNING", "OK", "INFO"] },
              "text": { "type": "string" }
            },
            "required": ["aplCode", "status", "text"]
          }
        }
      },
      "required": ["items"]
    },
    "tables": {
      "description": "Aggregated analysis results presented as tables.",
      "type": "object",
      "additionalProperties": {
        "type": "object",
        "properties": {
          "aplCode": { "type": "string" },
          "title": { "type": "string" },
          "columns": {
            "type": "object",
            "additionalProperties": {
              "type": "object",
              "properties": {
                "aplCode": { "type": "string" },
                "name": { "type": "string" },
                "width": { "type": "integer" },
                "formatter": { "type": ["object", "null"] },
                "renderer": { "type": ["object", "null"] },
                "truncateIfLonger": { "type": "boolean" }
              },
              "required": ["aplCode", "name", "width"]
            }
          },
          "rows": {
            "type": "array",
            "items": {
              "type": "object",
              "description": "All row values are strings. Rows may contain extra keys beyond the declared columns.",
              "additionalProperties": { "type": "string" }
            }
          },
          "position": { "type": "string", "enum": ["before-url-table", "after-url-table"] }
        },
        "required": ["aplCode", "title", "columns", "rows", "position"]
      }
    }
  },
  "required": ["crawler", "extraColumnsFromAnalysis", "options", "qualityScores", "results", "stats", "summary", "tables"]
}

5. Analysis Tables Description (tables key)

This section details the structure and columns of each table found under the tables key in the JSON output.

Important note on data types: All values in all table rows are strings. Numeric values such as counts, times, and sizes are serialized as strings (e.g., "51" not 51, "0.003" not 0.003). Empty values appear as "". This applies to every table described below. Where column descriptions say "count" or "time", the value is still a string representation of that number.

Some tables include extra row keys beyond the declared columns. These are noted in the individual table descriptions.

5.1. skipped-summary (Skipped URLs Summary)

Provides a summary of skipped URLs grouped by domain and reason.

Column Description
reason A human-readable string describing why URLs from this domain were skipped (e.g., "Not allowed host", "Blocked by robots.txt").
domain The domain name whose URLs were skipped.
count The number of unique URLs skipped for this domain and reason.

5.2. skipped (Skipped URLs)

Lists individual URLs that were skipped during the crawl.

Column Description
reason A human-readable string describing why the URL was skipped (e.g., "Not allowed host", "Blocked by robots.txt", "File extension is not allowed").
url The URL that was skipped.
sourceAttr A string describing the HTML attribute where the skipped URL was found (e.g., "<a href>", "<link href>", "<script src>").
sourceUqId The URL path of the page where the skipped URL was discovered (e.g., "/", "/docs/getting-started"). This allows linking back to the source page.

5.3. redirects (Redirected URLs)

Lists URLs that resulted in an HTTP redirect (3xx status code).

Column Description
statusCode The specific redirect status code (e.g., "301", "302").
url The original URL that redirected.
targetUrl The target URL to which the original URL redirected.
sourceUqId URL path of the page where the redirected URL was found.

5.4. 404 (404 URLs)

Lists URLs that resulted in a "404 Not Found" status code.

Column Description
statusCode The HTTP status code (typically "404").
url The URL that resulted in the 404 error.
sourceUqId URL path of the page where the broken URL was found.

5.5. certificate-info (SSL/TLS info)

Provides details about the SSL/TLS certificate of the crawled domain.

Column Description
info The name of the certificate attribute (e.g., "Issuer", "Subject", "Valid from", "Valid to", "Supported protocols", "RAW certificate output", "RAW protocols output").
value The value of the corresponding certificate attribute. Always a string. For multi-line values like raw certificate or protocol output, the entire content is a single string with embedded newlines.

5.6. fastest-urls (TOP fastest URLs)

Lists the URLs with the lowest request times encountered during the crawl.

Column Description
requestTime The time taken to fetch the URL in seconds (e.g., "0.003").
statusCode The HTTP status code of the URL (e.g., "200").
url The URL itself.

5.7. slowest-urls (TOP slowest URLs)

Lists the URLs with the highest request times encountered during the crawl.

Column Description
requestTime The time taken to fetch the URL in seconds (e.g., "1.234").
statusCode The HTTP status code of the URL (e.g., "200").
url The URL itself.

5.8. seo (SEO metadata)

Provides SEO-related metadata extracted from HTML pages.

Column Description
urlPathAndQuery The path and query string of the URL.
indexing A string describing the indexing status (e.g., "index, follow", "noindex, follow").
title The content of the <title> tag, or empty string if not found.
h1 The content of the first <h1> tag found, or empty string.
description The content of the meta name="description" tag, or empty string.
keywords The content of the meta name="keywords" tag, or empty string.

Extra row keys (present in each row object but not declared as columns):

  • robotsIndex (String): Whether the page allows indexing (e.g., "1" for index, "0" for noindex).
  • deniedByRobotsTxt (String): Whether the page is denied by robots.txt (e.g., "0" for allowed, "1" for denied).

5.9. open-graph (OpenGraph metadata)

Provides Open Graph and Twitter Card metadata extracted from HTML pages.

Column Description
urlPathAndQuery The path and query string of the URL.
ogTitle Content of the og:title meta tag, or empty string.
ogDescription Content of the og:description meta tag, or empty string.
ogImage Content of the og:image meta tag, or empty string.
twitterTitle Content of the twitter:title meta tag, or empty string.
twitterDescription Content of the twitter:description meta tag, or empty string.
twitterImage Content of the twitter:image meta tag, or empty string.

5.10. seo-headings (Heading structure)

Provides analysis of the heading (H1-H6) structure for each HTML page.

Column Description
headings A formatted string representation of the heading structure showing hierarchy and potential errors (e.g., "OK H1, H2, H2, H3" or "ERR H1, H3 (skipped H2)").
headingsCount Total number of headings found on the page (e.g., "5").
headingsErrorsCount Number of structural errors found in the headings (e.g., "0", "2").
urlPathAndQuery The path and query string of the URL.

Extra row key:

  • headingsHtml (String): An HTML string containing the full heading tree with markup (e.g., "<b>H1</b> Title<br><b>H2</b> Section..."). Useful for rendering a visual heading tree in reports.

5.11. headers (HTTP headers)

Summarizes the HTTP response headers encountered across all crawled URLs.

Column Description
header The name of the HTTP header.
occurrences The total number of times this header was found (e.g., "73").
uniqueValues The count of distinct values found for this header, as a string (e.g., "3").
valuesPreview A preview string showing some of the values encountered (truncated if many).
minValue The minimum value found (relevant for numerical or date headers), or empty string.
maxValue The maximum value found, or empty string.

5.12. headers-values (HTTP header values)

Lists unique values for each HTTP header and their occurrence count.

Column Description
header The name of the HTTP header.
occurrences The number of times this specific value occurred for this header (e.g., "51").
value The specific unique value of the HTTP header.

5.13. caching-per-content-type (HTTP Caching by content type)

Analyzes caching effectiveness grouped by general content type (HTML, Image, JS, CSS, etc.).

Column Description
contentType The general content type category (e.g., "HTML", "Image", "JS").
cacheType Description of the caching mechanism detected (e.g., "Cache-Control + ETag + Last-Modified", "No cache headers").
count Number of URLs matching this content type and cache type.
avgLifetime Average cache lifetime in seconds for URLs in this group, or empty string if not determinable.
minLifetime Minimum cache lifetime in seconds, or empty string.
maxLifetime Maximum cache lifetime in seconds, or empty string.

5.14. caching-per-domain (HTTP Caching by domain)

Analyzes caching effectiveness grouped by domain.

Column Description
domain The domain name.
cacheType Description of the caching mechanism detected.
count Number of URLs from this domain matching this cache type.
avgLifetime Average cache lifetime in seconds, or empty string.
minLifetime Minimum cache lifetime in seconds, or empty string.
maxLifetime Maximum cache lifetime in seconds, or empty string.

5.15. caching-per-domain-and-content-type (HTTP Caching by domain and content type)

Analyzes caching effectiveness grouped by both domain and general content type.

Column Description
domain The domain name.
contentType The general content type category.
cacheType Description of the caching mechanism detected.
count Number of URLs matching this domain, content type, and cache type.
avgLifetime Average cache lifetime in seconds, or empty string.
minLifetime Minimum cache lifetime in seconds, or empty string.
maxLifetime Maximum cache lifetime in seconds, or empty string.

5.16. non-unique-titles (TOP non-unique titles)

Lists page titles that appear on more than one page.

Column Description
count The number of pages sharing this title.
title The non-unique page title.

5.17. non-unique-descriptions (TOP non-unique descriptions)

Lists meta descriptions that appear on more than one page.

Column Description
count The number of pages sharing this description.
description The non-unique meta description content.

5.18. best-practices (Best practices)

Summarizes the results of various best practice checks performed by analyzers.

Column Description
analysisName The name of the specific best practice check (e.g., "Large inline SVGs", "Heading structure", "Brotli support").
ok Count of URLs passing this check.
notice Count of URLs with a notice-level finding.
warning Count of URLs with a warning-level finding.
critical Count of URLs with a critical-level finding.

5.19. accessibility (Accessibility)

Summarizes the results of accessibility checks.

Column Description
analysisName The name of the specific accessibility check (e.g., "Missing image alt attributes", "Missing html lang attribute", "ARIA roles and landmarks").
ok Count of elements/pages passing this check.
notice Count of notice-level findings.
warning Count of warning-level findings.
critical Count of critical-level findings.

5.20. source-domains (Source domains)

Provides statistics about the domains from which resources were loaded.

Column Description
domain The domain name.
totals A summary string showing total count, size, and time for resources from this domain (e.g., "67/30MB/6.2s").
HTML Summary string (count/size/time) for HTML resources from this domain.
Image Summary string for Image resources.
JS Summary string for JavaScript resources.
CSS Summary string for CSS resources.
Document Summary string for Document resources (e.g., robots.txt).

Extra row keys (dynamic, present when data exists):

  • Audio, Font, JSON, Other, Redirect, Video, XML (String): Summary strings for additional content types, included only when resources of that type are present.
  • totalCount (String): Total number of resources loaded from this domain.

Note: The set of content type columns is dynamic. The declared columns (HTML, Image, JS, CSS, Document) are always present, but additional content type columns appear in row data based on what resource types were actually encountered during the crawl.

5.21. content-types (Content types)

Summarizes statistics grouped by general content type.

Column Description
contentType The general content type category (e.g., "HTML", "Image").
count Total number of URLs of this content type.
totalSize Total size in bytes for this content type.
totalTime Total time spent fetching resources of this content type.
avgTime Average time spent fetching a resource of this content type.
status20x Count of URLs with a 2xx status code.
status40x Count of URLs with a 4xx status code.

Note: The status columns are dynamic. Additional columns like status42x (for HTTP 429) or status30x, status50x may appear depending on which status codes were actually encountered during the crawl. These dynamic columns will also be declared in the table's columns object.

5.22. content-types-raw (Content types (MIME types))

Summarizes statistics grouped by the specific MIME type reported in the Content-Type HTTP header.

Column Description
contentType The raw MIME type string (e.g., "text/html", "image/svg+xml", "text/html; charset=utf-8").
count Total number of URLs with this MIME type.
totalSize Total size in bytes.
totalTime Total time spent fetching.
avgTime Average time spent fetching.
status20x Count of URLs with a 2xx status code.
status40x Count of URLs with a 4xx status code.

Note: Like content-types, the status columns are dynamic. Additional status columns (e.g., status42x) appear when the corresponding status codes are encountered.

5.23. dns (DNS info)

Shows the DNS resolution information for the crawled domain(s).

Column Description
info A line of text representing part of the DNS resolution (e.g., the domain name, an IP address, the DNS server used). Presented as a simple text tree.

5.24. security (Security)

Summarizes findings related to security HTTP headers.

Column Description
header The name of the security header being analyzed (e.g., "Strict-Transport-Security", "X-Frame-Options", "Content-Security-Policy").
ok Count of URLs where the header was configured correctly.
notice Count of URLs with a notice-level finding.
warning Count of URLs with a warning-level finding.
critical Count of URLs with a critical-level finding.
recommendation A string containing textual recommendations for improving the configuration of this header.

Extra row key:

  • highestSeverity (String): The highest severity level found for this header across all URLs (e.g., "ok", "warning", "critical").

5.25. analysis-stats (Analysis stats)

Provides performance metrics for individual analyzer methods.

Column Description
classAndMethod The class and method name of the analyzer function.
execTime Total execution time in seconds spent in this method across all relevant URLs/data points.
execCount The number of times this method was executed.

Extra row key:

  • execTimeFormatted (String): Human-readable formatted execution time (e.g., "0.012 s", "1.234 s").

5.26. content-processors-stats (Content processor stats)

Provides performance metrics for content processor methods (HTML, CSS, JS, XML processors that run during the crawl).

Column Description
classAndMethod The class and method name of the content processor function.
execTime Total execution time in seconds spent in this method.
execCount The number of times this method was executed.

Extra row key:

  • execTimeFormatted (String): Human-readable formatted execution time.

5.27. external-urls (External URLs)

Lists external URLs discovered during the crawl along with where they were found.

Column Description
url The external URL that was discovered.
count The number of times this external URL was found across all crawled pages.
foundOn The URL of the page where this external URL was found (typically the first occurrence).

6. Note on Text Output

While this document focuses on the JSON output, SiteOne Crawler also offers a simpler Text output format (--output-text-file). The Text output provides a human-readable summary suitable for quick review in a terminal or text editor.

See the Text Output Documentation for more details on the Text format.