Possible mismatch between tokenisations

Currently, in our sentence-to-circuit pipeline, we feed in a tokenized version of the input `sentence` to bobcat to generate the ccg parse, and hence the expr (the tokenization is done by the `SpacyTokeniser` in `lambeq`)

however, the original `sentence` is later fed into a spacy model to generate coreference chains

if the spacy model produces a different tokenization to the lambeq SpacyTokeniser, this is bad and will cause things to break (e.g. different boxes, mismatched word indices)

We should find some way of ensuring the same tokenisation is used

<img width="671" alt="image" src="https://github.com/CQCL/text_to_discocirc/assets/93722876/dff379ed-c7f8-4b55-a318-2fb7b72d59ec">



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Possible mismatch between tokenisations #27

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Possible mismatch between tokenisations #27

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions