[Feature]: Add llms.txt detection and content-focused scoring criteria

## Problem Statement

AX Score currently lacks detection for several important AI discoverability signals, particularly for content-focused websites. The tool doesn't check for:

1. **`llms.txt`** — An emerging standard ([llmstxt.org](https://llmstxt.org/)) specifically designed to help LLMs discover and understand site content
2. **Rich JSON-LD schema types** — Only checks for JSON-LD presence, not specific schema types that matter for AI understanding (WebSite, BlogPosting, Person, BreadcrumbList, FAQPage)
3. **AI crawler permissions** — Whether robots.txt explicitly allows AI crawlers (GPTBot, ClaudeBot, PerplexityBot, anthropic-ai, Google-Extended)
4. **Content feed availability** — RSS/Atom feeds for machine-readable content syndication
5. **Semantic HTML quality** — Proper heading hierarchy, `<article>`, `<main>`, `<nav>` landmarks

These are the signals that **actually determine** whether AI agents can discover, understand, and cite a website's content.

## Proposed Solution

### 1. `llms.txt` Detection (High Priority)

```
Check: GET /llms.txt
Scoring:
  - File exists and is valid Markdown: +points
  - Contains H1 heading (site name): +points
  - Contains blockquote summary: +points
  - Contains sectioned URL lists: +points
  - Has companion llms-full.txt: +bonus points
```

### 2. JSON-LD Schema Type Analysis (High Priority)

Instead of just checking if `<script type="application/ld+json">` exists, analyze the **types** present:

| Schema Type | Page Context | Points |
|---|---|---|
| `WebSite` | Homepage | High |
| `BlogPosting` / `Article` | Article pages | High |
| `Person` / `Organization` | Any page | Medium |
| `BreadcrumbList` | All pages | Medium |
| `FAQPage` | FAQ pages | Medium |
| Uses stable `@id` references | Cross-page | Bonus |
| `sameAs` links to social profiles | Author entity | Bonus |

### 3. AI Crawler Permissions (Medium Priority)

Parse `robots.txt` and check:
- Does it explicitly allow AI crawler user agents?
- Are there specific AI crawler rules (not just catch-all)?
- Known AI crawlers: `GPTBot`, `ChatGPT-User`, `ClaudeBot`, `anthropic-ai`, `PerplexityBot`, `Google-Extended`, `Bard`, `Applebot-Extended`, `CCBot`

### 4. Content Feed Detection (Medium Priority)

```
Check: <link rel="alternate" type="application/rss+xml" ...>
Check: GET /rss.xml, /feed.xml, /atom.xml
Scoring: Feed exists and returns valid XML with entries
```

### 5. Semantic HTML Analysis (Lower Priority)

- Proper heading hierarchy (h1 → h2 → h3, no skips)
- Use of `<article>`, `<main>`, `<nav>`, `<header>`, `<footer>`
- Content accessible without JavaScript rendering
- Meaningful `<meta name="description">` present

## Alternatives Considered

- **Only add llms.txt**: Quick win but misses the bigger picture of content discoverability.
- **Rely on existing Discovery category**: The current Discovery check (16% for a well-optimized blog) shows it's not comprehensive enough.

## Use Case

As a content creator who has:
- ✅ Comprehensive JSON-LD (WebSite, BlogPosting, Person, BreadcrumbList, FAQPage schemas)
- ✅ `llms.txt` with curated content map
- ✅ robots.txt allowing 10+ AI crawlers
- ✅ RSS feed with all published posts
- ✅ XML sitemap
- ✅ Rich meta tags (OG, Twitter Cards, AEO tags)

I still get a Discovery score of **16%** because the tool doesn't detect most of these signals. This severely undervalues well-optimized content sites and makes the score unreliable for content creators.

## Additional Context

The `llms.txt` standard has growing adoption (~1,000+ domains) and is documented at [llmstxt.org](https://llmstxt.org/). While no major LLM provider has officially confirmed they follow `llms.txt` during crawling, it's a low-effort, high-signal file that clearly communicates site structure to AI systems.

Relevant research:
- [Semrush: What Is LLMs.txt & Should You Use It?](https://www.semrush.com/blog/llms-txt/)
- [Publii: The Complete Guide to llms.txt](https://getpublii.com/blog/llms-txt-complete-guide.html)
- [JSON-LD Masterclass: Implementing Schema for AI Agents](https://www.jasminedirectory.com/blog/json-ld-masterclass-implementing-schema-for-ai-agents/)
- [Schema Markup Best Practices 2026](https://geneo.app/blog/schema-markup-best-practices-2026-json-ld-audit/)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Feature]: Add llms.txt detection and content-focused scoring criteria #51

Problem Statement

Proposed Solution

1. `llms.txt` Detection (High Priority)

2. JSON-LD Schema Type Analysis (High Priority)

3. AI Crawler Permissions (Medium Priority)

4. Content Feed Detection (Medium Priority)

5. Semantic HTML Analysis (Lower Priority)

Alternatives Considered

Use Case

Additional Context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Schema Type	Page Context	Points
`WebSite`	Homepage	High
`BlogPosting` / `Article`	Article pages	High
`Person` / `Organization`	Any page	Medium
`BreadcrumbList`	All pages	Medium
`FAQPage`	FAQ pages	Medium
Uses stable `@id` references	Cross-page	Bonus
`sameAs` links to social profiles	Author entity	Bonus

Uh oh!

[Feature]: Add llms.txt detection and content-focused scoring criteria #51

Description

Problem Statement

Proposed Solution

1. llms.txt Detection (High Priority)

2. JSON-LD Schema Type Analysis (High Priority)

3. AI Crawler Permissions (Medium Priority)

4. Content Feed Detection (Medium Priority)

5. Semantic HTML Analysis (Lower Priority)

Alternatives Considered

Use Case

Additional Context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

1. `llms.txt` Detection (High Priority)