Skip to content

elizaluszczyk/dita-topic-to-keymap

Repository files navigation

DITA Topic to Keymap Converter

A Python command-line tool that extracts standard references from DITA XML topic files and generates DITA key definition maps.

Overview

This tool parses DITA topic files containing structured lists of standards and documentation references, then automatically generates a DITA keymap file with key definitions that can be reused across your documentation project.

Features

  • Multiple XML Structure Support: Handles various DITA XML structures for standard references
  • Intelligent Parsing: Uses a chain-of-responsibility pattern with specialized handlers for different XML patterns
  • Automatic ID Generation: Creates standardized IDs when not explicitly provided
  • Debugging Tools: Extract and inspect individual elements for troubleshooting
  • Configurable Logging: Multiple verbosity levels for detailed operation insight

Installation

# clone the repository:
git clone git@github.com:elizaluszczyk/dita-topic-to-keymap.git
cd dita-topic-to-keymap

# install development dependencies:
pip install -e .
pip install -r ./requirements/dev.txt
pre-commit install

Usage

Parse DITA Topic File

Convert a DITA topic file to a keymap:

ditatk parse input.xml

With custom output file:

ditatk parse input.xml -o standards-keymap.xml

With verbose logging:

# INFO level
ditatk parse input.xml -v

# DEBUG level (most detailed)
ditatk parse input.xml -vv

Extract Individual Elements

Extract and display a specific list item element (useful for debugging):

ditatk extract input.xml 5

This displays the 5th <li> element from the input file.

Supported XML Structures

The tool handles eight different XML patterns commonly found in DITA topics:

1. Keyword with ID

<li>
    <keyword id="iso-9001">ISO 9001</keyword>
</li>

Handler: KeywordWithIdHandler Result: (iso-9001, "ISO 9001")

2. Keyword without ID (with list item ID)

<li id="std-iso-14001">
    <keyword>ISO 14001</keyword>
</li>

Handler: KeywordWithoutIdHandler Result: (std-iso-14001, "ISO 14001")

3. Keyword without ID (auto-generated)

<li>
    <keyword>ISO 27001</keyword>
</li>

Handler: KeywordWithoutIdHandler Result: (std_iso-27001, "ISO 27001") - ID auto-generated from keyword text

4. List Item with Text Only (no keyword, no ID)

<li>IEEE 802.11</li>

Handler: ListItemWithoutKeywordHandler Result: (std_ieee-802-11, "IEEE 802.11") - ID auto-generated from text

5. List Item with ID and Citation (no keyword)

<li id="nist-sp-800-53">
    <cite>NIST Special Publication 800-53</cite>
</li>

Handler: ListItemWithoutKeywordHandler Result: (nist-sp-800-53, "NIST Special Publication 800-53")

6. List Item with Citation Containing Text

<li id="fips-140-2">
    <cite>Federal Information Processing Standard 140-2</cite>
    <keyword keyref="nist-fips"/>
</li>

Handler: ListItemWithCiteHandler Result: (fips-140-2, "Federal Information Processing Standard 140-2") - Uses cite text as description

7. Citation with Keyword Reference and Tail Text

<li id="rfc-7540">
    <cite><keyword keyref="ietf-rfc"/>Hypertext Transfer Protocol Version 2 (HTTP/2)</cite>
</li>

Handler: ListItemWithCiteHandler Result: (rfc-7540, "Hypertext Transfer Protocol Version 2 (HTTP/2)") - Uses only the text following the keyword reference

8. Keyword Nested in Citation

<li>
    <cite>
        <keyword id="gdpr">General Data Protection Regulation</keyword>
    </cite>
</li>

Handler: KeywordNestedInCiteHandler Result: (gdpr, "General Data Protection Regulation")

Handling Unparseable Elements

Elements Without Descriptions

The tool may encounter <li> elements that cannot be processed by any handler. This occurs when an element lacks sufficient content to generate a valid keymap entry (e.g., no keyword text, no citation text, or empty content).

Example scenario:

$ ditatk parse data/r_standards.xml
[2025-10-08 18:41:01,991] dita_topic_to_keymap.cli [WARNING] No handler was able to parse element 286
[2025-10-08 18:41:01,991] dita_topic_to_keymap.cli [WARNING] No handler was able to parse element 287
[2025-10-08 18:41:01,991] dita_topic_to_keymap.cli [WARNING] No handler was able to parse element 288

Output Format

Generated keymap files follow DITA map standards:

<?xml version="1.0" ?>
<!DOCTYPE map PUBLIC '-//OASIS//DTD DITA Map//EN' 'map.dtd'>
<!-- Generated automatically on 2025-10-08 16:15:34 -->
<map>
   <title>Standards and Documentation Key Definitions</title>
   <keydef keys="iso-9001">
      <topicmeta>
         <keywords>
            <keyword>ISO 9001</keyword>
         </keywords>
      </topicmeta>
   </keydef>
   <!-- Additional keydef elements... -->
</map>

Handler Chain

The tool uses specialized handlers in priority order:

  1. KeywordWithIdHandler - Processes <keyword> with explicit id
  2. KeywordWithoutIdHandler - Processes <keyword> without id
  3. ListItemWithoutKeywordHandler - Processes <li> without <keyword>
  4. ListItemWithCiteHandler - Processes <li> with <cite> elements
  5. KeywordNestedInCiteHandler - Processes <keyword> nested in <cite>

Logging Levels

  • No flag: WARNING (default)
  • -v: INFO
  • -vv: DEBUG

License

This project is licensed under the MIT License - see the LICENSE file for details.

Author

Eliza Łuszczyk

About

A Python command-line tool that extracts standard references from DITA XML topic files and generates DITA key definition maps.

Resources

License

Stars

Watchers

Forks

Contributors

Languages