AGENTS.md

This file provides guidance for AI assistants working with code in this repository.

Project Overview

This is a semantic parser for ProseMirror/TipTap content structures. It transforms rich text editor content into structured, semantic groups that web components can consume. The parser bridges the gap between natural content writing and component-based web development.

Development Commands

# Run all tests
npm test

# Run tests with JSON report output
npm run test-report

# Run a specific test file
npx jest tests/parser.test.js

# Run tests in watch mode
npx jest --watch

# Run a specific test by name
npx jest -t "handles simple document structure"

Architecture

Three-Stage Processing Pipeline

The parser processes content through three distinct stages, each building on the previous:

Sequence Processing (src/processors/sequence.js): Flattens the ProseMirror document tree into a linear sequence of semantic elements (headings, paragraphs, images, lists, etc.)
Groups Processing (src/processors/groups.js): Transforms the sequence into semantic groups with identified main content and items. Supports two grouping modes:
- Heading-based grouping (default)
- Divider-based grouping (when horizontal rules are present)
ByType Processing (src/processors/byType.js): Organizes elements by type with positional context, enabling type-specific queries

The main entry point (src/index.js) returns all three views:

import { parseContent } from './src/index.js';

const result = parseContent(doc);
// {
//   raw: doc,        // Original ProseMirror document
//   sequence: [...], // Flat sequence of elements
//   groups: {...},   // Semantic groups with main/items
//   byType: {...}    // Elements organized by type
// }

Content Output Structure

The parser returns a flat content structure:

{
  title: '',       // Main heading
  pretitle: '',    // Heading before main title
  subtitle: '',    // Heading after main title
  paragraphs: [],
  links: [],       // All link-like entities (including buttons, documents)
  images: [],
  icons: [],
  videos: [],
  lists: [],
  quotes: [],
  snippets: [],    // Fenced code — [{ language, code }]
  data: {},        // Structured data (tagged data blocks, forms, cards)
  headings: [],    // Headings after subtitle, in document order
  items: [],       // Child content groups (same structure recursively)
}

Link Roles

Links include buttons and documents, distinguished by role:

links: [
  { href: "/page", label: "Learn More", role: "link" },
  { href: "/action", label: "Get Started", role: "button", variant: "primary" },
  { href: "/file.pdf", label: "Download", role: "document", download: true },
]

Structured Data

The data object holds all structured content:

data: {
  "nav-links": [...],     // From ```yaml:nav-links
  "config": {...},        // From ```yaml:config
  "stats": [...],         // From FormBlock (activeSchemaId='stats') or ```yaml:stats
  "person": [...],        // From card-group with cardType="person"
  "event": [...]          // From card-group with cardType="event"
}

FormBlock data is routed to data[activeSchemaId]. A legacy FormBlock without a schemaId still lands at data.form.

Main Content Identification

The identifyMainContent() function (src/processors/groups.js:282) determines if the first group should be treated as main content:

Single group is always main content
First group must have lower heading level than second group
Divider mode affects main content identification

Special Element Detection

The sequence processor identifies several special element types by inspecting paragraph content:

Links: Paragraphs containing only a single link mark
Images: Paragraphs with single image (role: 'image' or 'banner')
Icons: Paragraphs with single image (role: 'icon')
Buttons: Editor button nodes → mapped to links with role: "button"
Videos: Paragraphs with single image (role: 'video')

Editor Node Mappings

Editor-specific nodes are mapped to standard entities:

button node → links[] with role: "button" and variant attribute
FormBlock → data[activeSchemaId] (fallback: data.form when no schemaId)
card-group → data[cardType] arrays (e.g., data.person, data.event)
document-group → links[] with role: "document" and download: true

Tagged Data Blocks

Data blocks with tags route parsed data to the data object:

```yaml:nav-links
- label: Home
  href: /
- label: About
  href: /about

title: My Site
theme: dark


JSON is also supported (`json:tag-name`) if you prefer.

Results in:
```js
content.data['nav-links'] = [{ label: "Home", href: "/" }]
content.data['config'] = { title: "My Site", theme: "dark" }

Parsing rules:

Tagged blocks with json language: parsed as JSON
Tagged blocks with yaml/yml language: parsed as YAML
Untagged blocks: not parsed (stay as raw text in sequence for display)

List Processing

Lists maintain hierarchy through nested structure. The processListItems() function in sequence.js handles nested lists, while processListContent() in groups.js applies full group content processing to each list item, allowing lists to contain rich content (images, paragraphs, nested lists, etc.).

Content Writing Conventions

Key patterns:

Pretitle Pattern: Any heading followed by a more important heading (e.g., H3→H1, H2→H1, H6→H5, etc.)
Banner Pattern: Image (with banner role or followed by heading) at start of first group
Divider Mode: Presence of any horizontalRule switches entire document to divider-based grouping
Heading Groups: Consecutive headings are consumed together only when each is exactly one level deeper (H1→H2 yes, H1→H3 no — skipped levels start a new group)
Main Content: First group is main if it's the only group OR has lower heading level than second group
Body Headings: Headings after the title and subtitle slots are collected in body.headings in document order

Testing Structure

Tests are organized by processor:

tests/parser.test.js - Integration tests
tests/processors/sequence.test.js - Sequence processing
tests/processors/groups.test.js - Groups processing
tests/processors/byType.test.js - ByType processing
tests/utils/role.test.js - Role utilities
tests/fixtures/ - Shared test documents

Important Implementation Notes

The parser never modifies the original ProseMirror document
Text content can include inline HTML for formatting (bold → <strong>, italic → <em>, links → <a>)
Context information in byType includes position, previous/next elements, and nearest heading
Group splitting logic differs significantly between heading mode and divider mode

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AGENTS.md

Project Overview

Development Commands

Architecture

Three-Stage Processing Pipeline

Content Output Structure

Link Roles

Structured Data

Main Content Identification

Special Element Detection

Editor Node Mappings

Tagged Data Blocks

List Processing

Content Writing Conventions

Testing Structure

Important Implementation Notes

FilesExpand file tree

AGENTS.md

Latest commit

History

AGENTS.md

File metadata and controls

AGENTS.md

Project Overview

Development Commands

Architecture

Three-Stage Processing Pipeline

Content Output Structure

Link Roles

Structured Data

Main Content Identification

Special Element Detection

Editor Node Mappings

Tagged Data Blocks

List Processing

Content Writing Conventions

Testing Structure

Important Implementation Notes