Home

Welcome to the pugihtml wiki!

WARNING!

IMPORTANT: pugihtml is a work-in-progress and it's not yet fully compliant with the HTML specifications!

If you're looking for a fully compliant HTML parser, then you're looking at the wrong place!

I do intend to make it fully compliant! If anybody is interested in helping, then please let me know.

WARNING!

pugihtml a C++ HTML processing library based on pugixml, consisting of a DOM-like interface with rich traversal/modification capabilities, an extremely fast HTML parser which constructs the DOM tree from an HTML file/buffer, and an XPath 1.0 implementation for complex data-driven tree queries. Full Unicode support is also available, with Unicode interface variants and conversions between different Unicode encodings (which happen automatically during parsing/saving).

At this stage pugihtml is still just an XML parser and it does not have official compliance with the HTML specifications. Compliance with the HTML specifications will be implemented as soon as possible, but for now I've made some modifications on pugixml which allow it to parse HTML:

Automatically close all opened tags that don't have a matching closing tag.
Normalize HTML tag capitalization: forcing upper case on all tags in order to correctly recognize matching tags.

Some things had to be "broken" in order to get the XML parser to start parsing HTML, so please take a look at the list of issues for more information on what's not working in the current version.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Home

Clone this wiki locally