Skip to content

smikhalevski/tag-soup

Repository files navigation

TagSoup

TagSoup is the fastest pure JS SAX/DOM XML/HTML parser and serializer.

  • Extremely low memory consumption.
  • Tolerant of malformed tag nesting, missing end tags, etc.
  • Recognizes CDATA sections, processing instructions, and DOCTYPE declarations.
  • Supports both strict XML and forgiving HTML parsing modes.
  • 20 kB gzipped, including dependencies.
  • Check out TagSoup dependencies: Speedy Entities and Flyweight DOM.
npm install --save-prod tag-soup

DOM parsing

TagSoup exports preconfigured HTMLDOMParser which parses HTML markup as a DOM node. This parser never throws errors during parsing and forgives malformed markup:

import { HTMLDOMParser, toHTML } from 'tag-soup';

const fragment = HTMLDOMParser.parseFragment('<p>hello<p>cool</br>');
// ⮕ DocumentFragment

toHTML(fragment);
// ⮕ '<p>hello</p><p>cool<br></p>'

HTMLDOMParser decodes both HTML entities and numeric character references with decodeHTML.

XMLDOMParser parses XML markup as a DOM node. It throws ParserError if markup doesn't satisfy XML spec:

import { XMLDOMParser, toXML } from 'tag-soup';

XMLDOMParser.parseFragment('<p>hello</br>');
// ❌ ParserError: Unexpected end tag.

const fragment = XMLDOMParser.parseFragment('<p>hello<br/></p>');
// ⮕ DocumentFragment

toXML(fragment);
// ⮕ '<p>hello<br/></p>'

XMLDOMParser decodes both XML entities and numeric character references with decodeXML.

TagSoup uses Flyweight DOM nodes, which provide many standard DOM manipulation features:

const document = HTMLDOMParser.parseDocument('<!DOCTYPE html><html>hello</html>');

document.doctype.name;
// ⮕ 'html'

document.textContent;
// ⮕ 'hello'

For example, you can use TreeWalker to traverse DOM nodes:

import { TreeWalker, NodeFilter } from 'flyweight-dom';

const fragment = XMLDOMParser.parseFragment('<p>hello<br/></p>');

const treeWalker = new TreeWalker(fragment, NodeFilter.SHOW_TEXT);

treeWalker.nextNode();
// ⮕ Text { 'hello' }

Create a custom DOM parser using createDOMParser:

import { createDOMParser } from 'tag-soup';

const myParser = createDOMParser({
  voidTags: ['br'],
});

myParser.parseFragment('<p><br></p>');
// ⮕ DocumentFragment

SAX parsing

TagSoup exports preconfigured HTMLSAXParser which parses HTML markup and calls handler methods when a token is read. This parser never throws errors during parsing and forgives malformed markup:

import { HTMLSAXParser } from 'tag-soup';

HTMLSAXParser.parseFragment('<p>hello<p>cool</br>', {
  onStartTagOpening(tagName) {
    // Called with 'p', 'p', and 'br'
  },
  onText(text) {
    // Called with 'hello' and 'cool'
  },
});

XMLSAXParser parses XML markup and calls handler methods when a token is read. It throws ParserError if markup doesn't satisfy XML spec:

import { XMLSAXParser } from 'tag-soup';

XMLSAXParser.parseFragment('<p>hello</br>', {});
// ❌ ParserError: Unexpected end tag.

XMLSAXParser.parseFragment('<p>hello<br/></p>', {
  onEndTag(tagName) {
    // Called with 'br' and 'p'
  },
});

Create a custom SAX parser using createSAXParser:

import { createSAXParser } from 'tag-soup';

const myParser = createSAXParser({
  voidTags: ['br'],
});

myParser.parseFragment('<p><br></p>', {
  onStartTagOpening(tagName) {
    // Called with 'p' and 'br'
  },
});

SAX handler callbacks

The SAXHandler defines the following optional callbacks. Implement only the ones you need.

Callback Description
onStartTagOpening A start tag name is read.
onAttribute An attribute and its decoded value were read.
onStartTagClosing A start tag is closed >.
onStartTagSelfClosing A start tag is self-closed />.
onStartTag A start tag and its atributes were read.
onEndTag An end tag matching an open start tag is read.
onText A decoded text content is read.
onComment A comment is read.
onDoctype A DOCTYPE declaration is read.
onCDATASection A CDATA section is read.
onProcessingInstruction A processing instruction is read.

Example using several callbacks at once:

import { HTMLSAXParser } from 'tag-soup';

HTMLSAXParser.parseFragment('<!-- greeting --><p class="x">hello</p>', {
  onComment(data) {
    // Called with ' greeting '
  },
  onStartTagOpening(tagName) {
    // Called with 'p'
  },
  onAttribute(name, value) {
    // Called with 'class', 'x'
  },
  onStartTagClosing() {
    // Called after all attributes of 'p' are read
  },
  onStartTag(tagName, attributes, isSelfClosing) {
    // Called after onStartTagClosing
  },
  onText(text) {
    // Called with 'hello'
  },
  onEndTag(tagName) {
    // Called with 'p'
  },
});

Tokenization

TagSoup exports preconfigured HTMLTokenizer which parses HTML markup and invokes a callback when a token is read. This tokenizer never throws errors during tokenization and forgives malformed markup:

import { HTMLTokenizer } from 'tag-soup';

HTMLTokenizer.tokenizeFragment('<p>hello<p>cool</br>', (token, startIndex, endIndex) => {
  // Handle token
});

XMLTokenizer parses XML markup and invokes a callback when a token is read. It throws ParserError if markup doesn't satisfy XML spec:

import { XMLTokenizer } from 'tag-soup';

XMLTokenizer.tokenizeFragment('<p>hello</br>', (token, startIndex, endIndex) => {});
// ❌ ParserError: Unexpected end tag.

XMLTokenizer.tokenizeFragment('<p>hello<br/></p>', (token, startIndex, endIndex) => {
  // Handle token
});

Create a custom tokenizer using createTokenizer:

import { createTokenizer } from 'tag-soup';

const myTokenizer = createTokenizer({
  voidTags: ['br'],
});

myTokenizer.tokenizeFragment('<p><br></p>', (token, startIndex, endIndex) => {
  // Handle token
});

The Token passed to the callback is one of the following string literals. startIndex and endIndex are the character positions of the token's value in the input.

Token Description
"TEXT" Text content between tags.
"START_TAG_NAME" The name portion of an opening tag, e.g. p in <p>.
"START_TAG_CLOSING" The > that closes an opening tag.
"START_TAG_SELF_CLOSING" The /> that self-closes a tag.
"END_TAG_NAME" The name portion of a closing tag, e.g. p in </p>.
"ATTRIBUTE_NAME" An attribute name.
"ATTRIBUTE_VALUE" A decoded attribute value.
"COMMENT" Comment content, excluding <!-- and -->.
"PROCESSING_INSTRUCTION_TARGET" The target of a processing instruction, e.g. xml in <?xml ...?>.
"PROCESSING_INSTRUCTION_DATA" The data portion of a processing instruction.
"CDATA_SECTION" Content of a CDATA section, excluding <![CDATA[ and ]]>.
"DOCTYPE_NAME" The name in a DOCTYPE declaration, e.g. html in <!DOCTYPE html>.

Serialization

TagSoup exports two preconfigured serializers: toHTML and toXML.

import { HTMLDOMParser, toHTML } from 'tag-soup';

const fragment = HTMLDOMParser.parseFragment('<p>hello<p>cool</br>');
// ⮕ DocumentFragment

toHTML(fragment);
// ⮕ '<p>hello</p><p>cool<br></p>'

Create a custom serializer using createSerializer:

import { HTMLDOMParser, createSerializer } from 'tag-soup';

const mySerializer = createSerializer({
  voidTags: ['br'],
});

const fragment = HTMLDOMParser.parseFragment('<p>hello</br>');
// ⮕ DocumentFragment

mySerializer(fragment);
// ⮕ '<p>hello<br></p>'

SerializerOptions accepts the following properties:

Option Description
voidTags Tags that have no content and no closing tag (e.g. br, img).
encodeText Callback to encode text content and attribute values.
areSelfClosingTags​Supported If true, void tags are serialized as <br/> instead of <br>.
areTagNamesCaseInsensitive If true, tag name comparisons are case-insensitive.

Serialize XML with entity encoding:

import { XMLDOMParser, createSerializer } from 'tag-soup';
import { encodeXML } from 'speedy-entities';

const toXMLEncoded = createSerializer({
  areSelfClosingTagsSupported: true,
  encodeText: encodeXML,
});

const fragment = XMLDOMParser.parseFragment('<note><text>AT&amp;T</text></note>');

toXMLEncoded(fragment);
// ⮕ '<note><text>AT&amp;T</text></note>'

Parser options

createDOMParser, createSAXParser, and createTokenizer accept a ParserOptions object.

Option Description
voidTags Tags that have no content and no end tag (e.g. br, img). See HTML5 Void Elements.
rawTextTags Tags whose content is treated as raw text (e.g. script, style). See HTML5 Raw Text Elements.
decodeText Callback to decode text content and attribute values (e.g. decodeHTML from speedy-entities).
implicitlyClosedTags Map from a tag to the list of open tags it implicitly closes. For example { h1: ['p'] } means an opening <h1> closes any currently open <p>.
implicitlyOpenedTags Tags for which a synthetic start tag is inserted when an unbalanced end tag is encountered (e.g. ['p', 'br'] so </p> becomes <p></p>).
areTagNames​CaseInsensitive If true, tag name comparisons ignore ASCII case.
areCDATASections​Recognized If true, CDATA sections (<![CDATA[...]]>) are recognized.
areProcessing​Instruction​Recognized If true, processing instructions (<?target data?>) are recognized.
areSelfClosingTags​Recognized If true, self-closing tags (<br/>) are recognized; otherwise treated as start tags.
isStrict If true, tag names and attributes are validated against XML constraints.
areUnbalanced​EndTags​Ignored If true, end tags without a matching start tag are silently dropped instead of throwing.
areUnbalanced​StartTags​ImplicitlyClosed If true, unclosed start tags are forcefully closed at the end of their parent.

A parser that mimics browser HTML behavior:

import { createDOMParser } from 'tag-soup';
import { decodeHTML } from 'speedy-entities';

const myParser = createDOMParser({
  voidTags: [
    'area',
    'base',
    'br',
    'col',
    'embed',
    'hr',
    'img',
    'input',
    'link',
    'meta',
    'param',
    'source',
    'track',
    'wbr',
  ],
  rawTextTags: ['script', 'style'],
  decodeText,
  areTagNamesCaseInsensitive: true,
  areUnbalancedEndTagsIgnored: true,
  areUnbalancedStartTagsImplicitlyClosed: true,
  implicitlyClosedTags: {
    h1: ['p'],
    h2: ['p'],
    li: ['li'],
    dt: ['dd', 'dt'],
    dd: ['dd', 'dt'],
  },
});

Performance

Execution performance is measured in operations per second (± 5%), the higher number is better. Memory consumption (RAM) is measured in bytes, the lower number is better.

Library Library size DOM parsing SAX parsing
Ops/sec RAM Ops/sec RAM
tag-soup​@3.2.1 21 kB 35 Hz 22 MB 54 Hz 22 kB
htmlparser2​@12.0.0 34 kB 15 Hz 35 MB 24 Hz 6 MB
parse5​@8.0.0 45 kB 7 Hz 105 MB 11 Hz 10 MB

Performance was measured when parsing the 3.64 MB HTML file.

Tests were conducted using TooFast on Apple M1 with Node.js v25.6.0.

To reproduce the performance test suite results, clone this repo and run:

npm ci
npm run build
npm run perf

Limitations

TagSoup doesn't resolve some quirky element structures that malformed HTML may cause.

Assume the following markup:

<p><strong>okay
<p>nope

With DOMParser this markup would be transformed to:

<p><strong>okay</strong></p>
<p><strong>nope</strong></p>

TagSoup doesn't insert the second strong tag:

<p><strong>okay</strong></p>
<p>nope</p>

About

🍜 The fastest pure JS SAX/DOM XML/HTML parser.

Topics

Resources

License

Stars

Watchers

Forks

Contributors