This file provides guidance for AI assistants working with code in this repository.
This is a semantic parser for ProseMirror/TipTap content structures. It transforms rich text editor content into structured, semantic groups that web components can consume. The parser bridges the gap between natural content writing and component-based web development.
# Run all tests
npm test
# Run tests with JSON report output
npm run test-report
# Run a specific test file
npx jest tests/parser.test.js
# Run tests in watch mode
npx jest --watch
# Run a specific test by name
npx jest -t "handles simple document structure"The parser processes content through three distinct stages, each building on the previous:
-
Sequence Processing (
src/processors/sequence.js): Flattens the ProseMirror document tree into a linear sequence of semantic elements (headings, paragraphs, images, lists, etc.) -
Groups Processing (
src/processors/groups.js): Transforms the sequence into semantic groups with identified main content and items. Supports two grouping modes:- Heading-based grouping (default)
- Divider-based grouping (when horizontal rules are present)
-
ByType Processing (
src/processors/byType.js): Organizes elements by type with positional context, enabling type-specific queries
The main entry point (src/index.js) returns all three views:
import { parseContent } from './src/index.js';
const result = parseContent(doc);
// {
// raw: doc, // Original ProseMirror document
// sequence: [...], // Flat sequence of elements
// groups: {...}, // Semantic groups with main/items
// byType: {...} // Elements organized by type
// }The parser returns a flat content structure:
{
title: '', // Main heading
pretitle: '', // Heading before main title
subtitle: '', // Heading after main title
paragraphs: [],
links: [], // All link-like entities (including buttons, documents)
images: [],
icons: [],
videos: [],
lists: [],
quotes: [],
snippets: [], // Fenced code — [{ language, code }]
data: {}, // Structured data (tagged data blocks, forms, cards)
headings: [], // Headings after subtitle, in document order
items: [], // Child content groups (same structure recursively)
}Links include buttons and documents, distinguished by role:
links: [
{ href: "/page", label: "Learn More", role: "link" },
{ href: "/action", label: "Get Started", role: "button", variant: "primary" },
{ href: "/file.pdf", label: "Download", role: "document", download: true },
]The data object holds all structured content:
data: {
"nav-links": [...], // From ```yaml:nav-links
"config": {...}, // From ```yaml:config
"stats": [...], // From FormBlock (activeSchemaId='stats') or ```yaml:stats
"person": [...], // From card-group with cardType="person"
"event": [...] // From card-group with cardType="event"
}FormBlock data is routed to data[activeSchemaId]. A legacy
FormBlock without a schemaId still lands at data.form.
The identifyMainContent() function (src/processors/groups.js:282) determines if the first group should be treated as main content:
- Single group is always main content
- First group must have lower heading level than second group
- Divider mode affects main content identification
The sequence processor identifies several special element types by inspecting paragraph content:
- Links: Paragraphs containing only a single link mark
- Images: Paragraphs with single image (role: 'image' or 'banner')
- Icons: Paragraphs with single image (role: 'icon')
- Buttons: Editor
buttonnodes → mapped to links withrole: "button" - Videos: Paragraphs with single image (role: 'video')
Editor-specific nodes are mapped to standard entities:
buttonnode →links[]withrole: "button"andvariantattributeFormBlock→data[activeSchemaId](fallback:data.formwhen no schemaId)card-group→data[cardType]arrays (e.g.,data.person,data.event)document-group→links[]withrole: "document"anddownload: true
Data blocks with tags route parsed data to the data object:
```yaml:nav-links
- label: Home
href: /
- label: About
href: /abouttitle: My Site
theme: dark
JSON is also supported (`json:tag-name`) if you prefer.
Results in:
```js
content.data['nav-links'] = [{ label: "Home", href: "/" }]
content.data['config'] = { title: "My Site", theme: "dark" }
Parsing rules:
- Tagged blocks with
jsonlanguage: parsed as JSON - Tagged blocks with
yaml/ymllanguage: parsed as YAML - Untagged blocks: not parsed (stay as raw text in sequence for display)
Lists maintain hierarchy through nested structure. The processListItems() function in sequence.js handles nested lists, while processListContent() in groups.js applies full group content processing to each list item, allowing lists to contain rich content (images, paragraphs, nested lists, etc.).
Key patterns:
- Pretitle Pattern: Any heading followed by a more important heading (e.g., H3→H1, H2→H1, H6→H5, etc.)
- Banner Pattern: Image (with banner role or followed by heading) at start of first group
- Divider Mode: Presence of any
horizontalRuleswitches entire document to divider-based grouping - Heading Groups: Consecutive headings are consumed together only when each is exactly one level deeper (H1→H2 yes, H1→H3 no — skipped levels start a new group)
- Main Content: First group is main if it's the only group OR has lower heading level than second group
- Body Headings: Headings after the title and subtitle slots are collected in
body.headingsin document order
Tests are organized by processor:
tests/parser.test.js- Integration teststests/processors/sequence.test.js- Sequence processingtests/processors/groups.test.js- Groups processingtests/processors/byType.test.js- ByType processingtests/utils/role.test.js- Role utilitiestests/fixtures/- Shared test documents
- The parser never modifies the original ProseMirror document
- Text content can include inline HTML for formatting (bold →
<strong>, italic →<em>, links →<a>) - Context information in byType includes position, previous/next elements, and nearest heading
- Group splitting logic differs significantly between heading mode and divider mode