-
-
Notifications
You must be signed in to change notification settings - Fork 0
Description
Problem Statement
AX Score currently lacks detection for several important AI discoverability signals, particularly for content-focused websites. The tool doesn't check for:
llms.txt— An emerging standard (llmstxt.org) specifically designed to help LLMs discover and understand site content- Rich JSON-LD schema types — Only checks for JSON-LD presence, not specific schema types that matter for AI understanding (WebSite, BlogPosting, Person, BreadcrumbList, FAQPage)
- AI crawler permissions — Whether robots.txt explicitly allows AI crawlers (GPTBot, ClaudeBot, PerplexityBot, anthropic-ai, Google-Extended)
- Content feed availability — RSS/Atom feeds for machine-readable content syndication
- Semantic HTML quality — Proper heading hierarchy,
<article>,<main>,<nav>landmarks
These are the signals that actually determine whether AI agents can discover, understand, and cite a website's content.
Proposed Solution
1. llms.txt Detection (High Priority)
Check: GET /llms.txt
Scoring:
- File exists and is valid Markdown: +points
- Contains H1 heading (site name): +points
- Contains blockquote summary: +points
- Contains sectioned URL lists: +points
- Has companion llms-full.txt: +bonus points
2. JSON-LD Schema Type Analysis (High Priority)
Instead of just checking if <script type="application/ld+json"> exists, analyze the types present:
| Schema Type | Page Context | Points |
|---|---|---|
WebSite |
Homepage | High |
BlogPosting / Article |
Article pages | High |
Person / Organization |
Any page | Medium |
BreadcrumbList |
All pages | Medium |
FAQPage |
FAQ pages | Medium |
Uses stable @id references |
Cross-page | Bonus |
sameAs links to social profiles |
Author entity | Bonus |
3. AI Crawler Permissions (Medium Priority)
Parse robots.txt and check:
- Does it explicitly allow AI crawler user agents?
- Are there specific AI crawler rules (not just catch-all)?
- Known AI crawlers:
GPTBot,ChatGPT-User,ClaudeBot,anthropic-ai,PerplexityBot,Google-Extended,Bard,Applebot-Extended,CCBot
4. Content Feed Detection (Medium Priority)
Check: <link rel="alternate" type="application/rss+xml" ...>
Check: GET /rss.xml, /feed.xml, /atom.xml
Scoring: Feed exists and returns valid XML with entries
5. Semantic HTML Analysis (Lower Priority)
- Proper heading hierarchy (h1 → h2 → h3, no skips)
- Use of
<article>,<main>,<nav>,<header>,<footer> - Content accessible without JavaScript rendering
- Meaningful
<meta name="description">present
Alternatives Considered
- Only add llms.txt: Quick win but misses the bigger picture of content discoverability.
- Rely on existing Discovery category: The current Discovery check (16% for a well-optimized blog) shows it's not comprehensive enough.
Use Case
As a content creator who has:
- ✅ Comprehensive JSON-LD (WebSite, BlogPosting, Person, BreadcrumbList, FAQPage schemas)
- ✅
llms.txtwith curated content map - ✅ robots.txt allowing 10+ AI crawlers
- ✅ RSS feed with all published posts
- ✅ XML sitemap
- ✅ Rich meta tags (OG, Twitter Cards, AEO tags)
I still get a Discovery score of 16% because the tool doesn't detect most of these signals. This severely undervalues well-optimized content sites and makes the score unreliable for content creators.
Additional Context
The llms.txt standard has growing adoption (~1,000+ domains) and is documented at llmstxt.org. While no major LLM provider has officially confirmed they follow llms.txt during crawling, it's a low-effort, high-signal file that clearly communicates site structure to AI systems.
Relevant research: