-
Notifications
You must be signed in to change notification settings - Fork 8
Span Vs ESpan
Manhal Daaboul edited this page May 10, 2021
·
1 revision
We have the following match dict:
{
"money-spaces-1": {
"patterns": [
[
{
"TEXT": {
"REGEX": "(USD|EUR|JPY|GBP|AUD|CAD|CNY|CHF|HKD|NZD)([0-9]{1,})"
}
}
]
],
"description": "For currencies, use the currency code-amount format with spaces in between",
"category": "DATE_TIME_MONEY_NUMBERS",
"subcategory": "MONEY_C_CODE_SPACES",
"suggestions": [
[
{
"PATTERN_REF": 0,
"REGEX": "\\1 \\2"
}
]
],
"suggestions_separator": "",
"test": {
"positive": [
"Please pay USD10,000.",
"Please pay USD10000."
],
"negative": [
"Please pay USD 10,000."
]
}
},
"money-sym-spaces-2": {
"patterns": [
[
{
"TEXT": {
"REGEX": "(USD)([0-9,\\.]{0,}[0-9]{1,})"
}
}
]
],
"description": "For currencies, use the currency symbol-amount format with spaces in between",
"category": "DATE_TIME_MONEY_NUMBERS",
"subcategory": "MONEY_C_SYM_AMT_SPACES",
"suggestions": [
[
{
"PATTERN_REF": 0,
"REGEX": "$ \\2"
}
]
],
"suggestions_separator": " ",
"test": {
"positive": [
"We bought it for USD10.",
"We bought it for USD10,000."
],
"negative": [
"We bought it for $ 10,000."
]
}
}
}
We sent this segment: "We bought it for USD30,000.00"
This produces two matches of the same start, end, and text "USD30,000.00", under different subcategory, so we need to be able to present two different issues with same start and end, but different subcategory, suggestions, ...etc.
spacy matcher sees those matches as one span and provide a reference to the matched span, instead of providing a new object, since it is the same start and end. So when we set the span extensions, both spans gets the same (last) value:
Span1:
start = 4
end = 5
subcategory = MONEY_C_SYM_AMT_SPACES
Span2:
start = 4
end = 5
subcategory = MONEY_C_SYM_AMT_SPACES
While Span1 should have subcategory as MONEY_C_CODE_SPACES
To solve this problem, we extended spacy Span and created ESpan with every extension as class attribute.