It would be good to have standard train/dev/test splits (and folds)
Because the corpora, especially the Sherlock Holmes stories have widely varying sentence lengths, we should split on number of words, not sentences. Ideally, splits should happen only on the first sentence in a paragraph, or a header.
It would be good to have standard train/dev/test splits (and folds)
Because the corpora, especially the Sherlock Holmes stories have widely varying sentence lengths, we should split on number of words, not sentences. Ideally, splits should happen only on the first sentence in a paragraph, or a header.