Skip to content

Comments

Feature: Add support for stripping XML comments and declarations#41

Open
GunwooYun wants to merge 1 commit intoooxi:masterfrom
GunwooYun:master
Open

Feature: Add support for stripping XML comments and declarations#41
GunwooYun wants to merge 1 commit intoooxi:masterfrom
GunwooYun:master

Conversation

@GunwooYun
Copy link

Description

This PR implements a pre-processing step in xml_parse_document to strip XML comments (<!-- ... -->) and XML declarations (<? ... ?>) from the input buffer before parsing.

Prior to this change, the parser would likely choke on comments or declarations as it expects a strictly structured validation-ready XML or a specific subset. This change allows the parser to handle more standard XML files that include comments or headers.

Implementation Details

  • Modified xml_parse_document in src/xml.c.
  • Implemented an in-place buffer modification loop that shifts valid bytes to overwrite comments and declarations.
  • Updated document_length to reflect the new, shorter content length.

Testing

I have created a test suite in app/ including a custom test runner app/test_runner.c and several test cases:

  • Basic Comments:

    <!-- comment -->
    <root>content</root>
  • Multiline Comments:

    <!-- 
     multi 
     line 
     -->
    <root>content</root>
  • XML Declarations:

    <?xml version="1.0"?>
    <root>content</root>
  • Nested Comments:

    <root>
        <level1>
            <!-- level 1 comment -->
            <level2>...</level2>
        </level1>
        <!-- sibling comment -->
    </root>
  • Consecutive Comments:

    <!-- 1 --><!-- 2 -->
    <root>...</root>

All the above tests pass successfully.

Known Limitations

The current implementation is a naive approach. It blindly searches for <!-- and --> markers without context.

  • Tricky Attributes: If a comment marker appears inside an attribute value (e.g., <node attr="<!--">), it will be incorrectly identified as a comment start, corrupting the XML.
  • Unclosed Comments: An unclosed comment might be interpreted as a node named !-- or cause malformed output, as it doesn't strictly validate comment syntax.

I have added a warning comment in the code highlighting these limitations. For the intended lightweight use case of this library, this trade-off was chosen for simplicity and performance (O(n) single pass).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant