Skip to content

Strange behavior of tokenize(.., only_ci=True) #10

@lumpidu

Description

@lumpidu

The following snippet gives inconsistent results:

from reynir_correct import tokenize

texts = ["Skúta", "300 ára gömul írsk skúta fundin við Suður-Noreg" ]
for t in texts:
    g = tokenize(t, only_ci=True)
    for t in g:
        if t.txt:
            print(f"{t.txt:12} {t.error_code:8} {t.error_description}")

Output:

Skúta                 
300                   
ára                   
gömul                 
írsk                  
skúta        U001     Óþekkt orð: 'skúta'
fundin                
við                   
Suður-Noreg

The correct word skúta is marked as unknown, but not if it's written as standalone word. Using no options for the tokenize() method works as expected.

It's also not clear from the documentation, what exactly the optiononly_ci does.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions