Skip to content

Latest commit

 

History

History
192 lines (143 loc) · 6.79 KB

File metadata and controls

192 lines (143 loc) · 6.79 KB

AGENTS.md

This file provides guidance for AI assistants working with code in this repository.

Project Overview

This is a semantic parser for ProseMirror/TipTap content structures. It transforms rich text editor content into structured, semantic groups that web components can consume. The parser bridges the gap between natural content writing and component-based web development.

Development Commands

# Run all tests
npm test

# Run tests with JSON report output
npm run test-report

# Run a specific test file
npx jest tests/parser.test.js

# Run tests in watch mode
npx jest --watch

# Run a specific test by name
npx jest -t "handles simple document structure"

Architecture

Three-Stage Processing Pipeline

The parser processes content through three distinct stages, each building on the previous:

  1. Sequence Processing (src/processors/sequence.js): Flattens the ProseMirror document tree into a linear sequence of semantic elements (headings, paragraphs, images, lists, etc.)

  2. Groups Processing (src/processors/groups.js): Transforms the sequence into semantic groups with identified main content and items. Supports two grouping modes:

    • Heading-based grouping (default)
    • Divider-based grouping (when horizontal rules are present)
  3. ByType Processing (src/processors/byType.js): Organizes elements by type with positional context, enabling type-specific queries

The main entry point (src/index.js) returns all three views:

import { parseContent } from './src/index.js';

const result = parseContent(doc);
// {
//   raw: doc,        // Original ProseMirror document
//   sequence: [...], // Flat sequence of elements
//   groups: {...},   // Semantic groups with main/items
//   byType: {...}    // Elements organized by type
// }

Content Output Structure

The parser returns a flat content structure:

{
  title: '',       // Main heading
  pretitle: '',    // Heading before main title
  subtitle: '',    // Heading after main title
  paragraphs: [],
  links: [],       // All link-like entities (including buttons, documents)
  images: [],
  icons: [],
  videos: [],
  lists: [],
  quotes: [],
  snippets: [],    // Fenced code — [{ language, code }]
  data: {},        // Structured data (tagged data blocks, forms, cards)
  headings: [],    // Headings after subtitle, in document order
  items: [],       // Child content groups (same structure recursively)
}

Link Roles

Links include buttons and documents, distinguished by role:

links: [
  { href: "/page", label: "Learn More", role: "link" },
  { href: "/action", label: "Get Started", role: "button", variant: "primary" },
  { href: "/file.pdf", label: "Download", role: "document", download: true },
]

Structured Data

The data object holds all structured content:

data: {
  "nav-links": [...],     // From ```yaml:nav-links
  "config": {...},        // From ```yaml:config
  "stats": [...],         // From FormBlock (activeSchemaId='stats') or ```yaml:stats
  "person": [...],        // From card-group with cardType="person"
  "event": [...]          // From card-group with cardType="event"
}

FormBlock data is routed to data[activeSchemaId]. A legacy FormBlock without a schemaId still lands at data.form.

Main Content Identification

The identifyMainContent() function (src/processors/groups.js:282) determines if the first group should be treated as main content:

  • Single group is always main content
  • First group must have lower heading level than second group
  • Divider mode affects main content identification

Special Element Detection

The sequence processor identifies several special element types by inspecting paragraph content:

  • Links: Paragraphs containing only a single link mark
  • Images: Paragraphs with single image (role: 'image' or 'banner')
  • Icons: Paragraphs with single image (role: 'icon')
  • Buttons: Editor button nodes → mapped to links with role: "button"
  • Videos: Paragraphs with single image (role: 'video')

Editor Node Mappings

Editor-specific nodes are mapped to standard entities:

  • button node → links[] with role: "button" and variant attribute
  • FormBlockdata[activeSchemaId] (fallback: data.form when no schemaId)
  • card-groupdata[cardType] arrays (e.g., data.person, data.event)
  • document-grouplinks[] with role: "document" and download: true

Tagged Data Blocks

Data blocks with tags route parsed data to the data object:

```yaml:nav-links
- label: Home
  href: /
- label: About
  href: /about
title: My Site
theme: dark

JSON is also supported (`json:tag-name`) if you prefer.

Results in:
```js
content.data['nav-links'] = [{ label: "Home", href: "/" }]
content.data['config'] = { title: "My Site", theme: "dark" }

Parsing rules:

  • Tagged blocks with json language: parsed as JSON
  • Tagged blocks with yaml/yml language: parsed as YAML
  • Untagged blocks: not parsed (stay as raw text in sequence for display)

List Processing

Lists maintain hierarchy through nested structure. The processListItems() function in sequence.js handles nested lists, while processListContent() in groups.js applies full group content processing to each list item, allowing lists to contain rich content (images, paragraphs, nested lists, etc.).

Content Writing Conventions

Key patterns:

  • Pretitle Pattern: Any heading followed by a more important heading (e.g., H3→H1, H2→H1, H6→H5, etc.)
  • Banner Pattern: Image (with banner role or followed by heading) at start of first group
  • Divider Mode: Presence of any horizontalRule switches entire document to divider-based grouping
  • Heading Groups: Consecutive headings are consumed together only when each is exactly one level deeper (H1→H2 yes, H1→H3 no — skipped levels start a new group)
  • Main Content: First group is main if it's the only group OR has lower heading level than second group
  • Body Headings: Headings after the title and subtitle slots are collected in body.headings in document order

Testing Structure

Tests are organized by processor:

  • tests/parser.test.js - Integration tests
  • tests/processors/sequence.test.js - Sequence processing
  • tests/processors/groups.test.js - Groups processing
  • tests/processors/byType.test.js - ByType processing
  • tests/utils/role.test.js - Role utilities
  • tests/fixtures/ - Shared test documents

Important Implementation Notes

  • The parser never modifies the original ProseMirror document
  • Text content can include inline HTML for formatting (bold → <strong>, italic → <em>, links → <a>)
  • Context information in byType includes position, previous/next elements, and nearest heading
  • Group splitting logic differs significantly between heading mode and divider mode