Skip to content

Span Vs ESpan

Manhal Daaboul edited this page May 10, 2021 · 1 revision

We have the following match dict:

{
    "money-spaces-1": {
        "patterns": [
            [
                {
                    "TEXT": {
                        "REGEX": "(USD|EUR|JPY|GBP|AUD|CAD|CNY|CHF|HKD|NZD)([0-9]{1,})"
                    }
                }
            ]
        ],
        "description": "For currencies, use the currency code-amount format with spaces in between",
        "category": "DATE_TIME_MONEY_NUMBERS",
        "subcategory": "MONEY_C_CODE_SPACES",
        "suggestions": [
            [
                {
                    "PATTERN_REF": 0,
                    "REGEX": "\\1 \\2"
                }
            ]
        ],
        "suggestions_separator": "",
        "test": {
            "positive": [
                "Please pay USD10,000.",
                "Please pay USD10000."
            ],
            "negative": [
                "Please pay USD 10,000."
            ]
        }
    },
    "money-sym-spaces-2": {
        "patterns": [
            [
                {
                    "TEXT": {
                        "REGEX": "(USD)([0-9,\\.]{0,}[0-9]{1,})"
                    }
                }
            ]
        ],
        "description": "For currencies, use the currency symbol-amount format with spaces in between",
        "category": "DATE_TIME_MONEY_NUMBERS",
        "subcategory": "MONEY_C_SYM_AMT_SPACES",
        "suggestions": [
            [
                {
                    "PATTERN_REF": 0,
                    "REGEX": "$ \\2"
                }
            ]
        ],
        "suggestions_separator": " ",
        "test": {
            "positive": [
                "We bought it for USD10.",
                "We bought it for USD10,000."
            ],
            "negative": [
                "We bought it for $ 10,000."
            ]
        }
    }
}

We sent this segment: "We bought it for USD30,000.00"

This produces two matches of the same start, end, and text "USD30,000.00", under different subcategory, so we need to be able to present two different issues with same start and end, but different subcategory, suggestions, ...etc.

spacy matcher sees those matches as one span and provide a reference to the matched span, instead of providing a new object, since it is the same start and end. So when we set the span extensions, both spans gets the same (last) value:

Span1:
start = 4
end = 5
subcategory = MONEY_C_SYM_AMT_SPACES

Span2:
start = 4
end = 5
subcategory = MONEY_C_SYM_AMT_SPACES

While Span1 should have subcategory as MONEY_C_CODE_SPACES

To solve this problem, we extended spacy Span and created ESpan with every extension as class attribute.

Clone this wiki locally