Skip to content

Possible mismatch between tokenisations #27

@JosephNathaniel

Description

@JosephNathaniel

Currently, in our sentence-to-circuit pipeline, we feed in a tokenized version of the input sentence to bobcat to generate the ccg parse, and hence the expr (the tokenization is done by the SpacyTokeniser in lambeq)

however, the original sentence is later fed into a spacy model to generate coreference chains

if the spacy model produces a different tokenization to the lambeq SpacyTokeniser, this is bad and will cause things to break (e.g. different boxes, mismatched word indices)

We should find some way of ensuring the same tokenisation is used

image

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions