Skip to content

[Feature]: Add llms.txt detection and content-focused scoring criteria #51

@IISweetHeartII

Description

@IISweetHeartII

Problem Statement

AX Score currently lacks detection for several important AI discoverability signals, particularly for content-focused websites. The tool doesn't check for:

  1. llms.txt — An emerging standard (llmstxt.org) specifically designed to help LLMs discover and understand site content
  2. Rich JSON-LD schema types — Only checks for JSON-LD presence, not specific schema types that matter for AI understanding (WebSite, BlogPosting, Person, BreadcrumbList, FAQPage)
  3. AI crawler permissions — Whether robots.txt explicitly allows AI crawlers (GPTBot, ClaudeBot, PerplexityBot, anthropic-ai, Google-Extended)
  4. Content feed availability — RSS/Atom feeds for machine-readable content syndication
  5. Semantic HTML quality — Proper heading hierarchy, <article>, <main>, <nav> landmarks

These are the signals that actually determine whether AI agents can discover, understand, and cite a website's content.

Proposed Solution

1. llms.txt Detection (High Priority)

Check: GET /llms.txt
Scoring:
  - File exists and is valid Markdown: +points
  - Contains H1 heading (site name): +points
  - Contains blockquote summary: +points
  - Contains sectioned URL lists: +points
  - Has companion llms-full.txt: +bonus points

2. JSON-LD Schema Type Analysis (High Priority)

Instead of just checking if <script type="application/ld+json"> exists, analyze the types present:

Schema Type Page Context Points
WebSite Homepage High
BlogPosting / Article Article pages High
Person / Organization Any page Medium
BreadcrumbList All pages Medium
FAQPage FAQ pages Medium
Uses stable @id references Cross-page Bonus
sameAs links to social profiles Author entity Bonus

3. AI Crawler Permissions (Medium Priority)

Parse robots.txt and check:

  • Does it explicitly allow AI crawler user agents?
  • Are there specific AI crawler rules (not just catch-all)?
  • Known AI crawlers: GPTBot, ChatGPT-User, ClaudeBot, anthropic-ai, PerplexityBot, Google-Extended, Bard, Applebot-Extended, CCBot

4. Content Feed Detection (Medium Priority)

Check: <link rel="alternate" type="application/rss+xml" ...>
Check: GET /rss.xml, /feed.xml, /atom.xml
Scoring: Feed exists and returns valid XML with entries

5. Semantic HTML Analysis (Lower Priority)

  • Proper heading hierarchy (h1 → h2 → h3, no skips)
  • Use of <article>, <main>, <nav>, <header>, <footer>
  • Content accessible without JavaScript rendering
  • Meaningful <meta name="description"> present

Alternatives Considered

  • Only add llms.txt: Quick win but misses the bigger picture of content discoverability.
  • Rely on existing Discovery category: The current Discovery check (16% for a well-optimized blog) shows it's not comprehensive enough.

Use Case

As a content creator who has:

  • ✅ Comprehensive JSON-LD (WebSite, BlogPosting, Person, BreadcrumbList, FAQPage schemas)
  • llms.txt with curated content map
  • ✅ robots.txt allowing 10+ AI crawlers
  • ✅ RSS feed with all published posts
  • ✅ XML sitemap
  • ✅ Rich meta tags (OG, Twitter Cards, AEO tags)

I still get a Discovery score of 16% because the tool doesn't detect most of these signals. This severely undervalues well-optimized content sites and makes the score unreliable for content creators.

Additional Context

The llms.txt standard has growing adoption (~1,000+ domains) and is documented at llmstxt.org. While no major LLM provider has officially confirmed they follow llms.txt during crawling, it's a low-effort, high-signal file that clearly communicates site structure to AI systems.

Relevant research:

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions