Currently, in our sentence-to-circuit pipeline, we feed in a tokenized version of the input sentence to bobcat to generate the ccg parse, and hence the expr (the tokenization is done by the SpacyTokeniser in lambeq)
however, the original sentence is later fed into a spacy model to generate coreference chains
if the spacy model produces a different tokenization to the lambeq SpacyTokeniser, this is bad and will cause things to break (e.g. different boxes, mismatched word indices)
We should find some way of ensuring the same tokenisation is used

Currently, in our sentence-to-circuit pipeline, we feed in a tokenized version of the input
sentenceto bobcat to generate the ccg parse, and hence the expr (the tokenization is done by theSpacyTokeniserinlambeq)however, the original
sentenceis later fed into a spacy model to generate coreference chainsif the spacy model produces a different tokenization to the lambeq SpacyTokeniser, this is bad and will cause things to break (e.g. different boxes, mismatched word indices)
We should find some way of ensuring the same tokenisation is used