sentence_splitting

This currently only works for the English-language files.

Install:

git clone git@github.com:RedHenLab/sentence_splitting.git
pip install -r requirements.txt

Usage:

python3 sentence_splitting.py -a /path/to/nonbreaking_prefixes/ [-c captioning_specials.tsv] inputfile.txt | perl filter_metainfo_from_cclines.pl path/to/dictionaries | perl join_lines.pl > outputfile.xml

The output is a well-formed XML file that contains exactly one sentence per line. XML tags relevant to the sentence are not guaranteed to be on the same line as the sentence.

To check that the file is ok, it can be tested with

xmllint --noout outputfile.xml

The optional parameter -c captioning_specials.tsv should denote a file, in which lines containing (non-spoken) captioning information are listed. For example

Captioning funded by CBS\tand FORD.\tWe go further, so you can.

with multiple lines per caption separated by tabs(\t).

If this command terminates without printing an error message, the file is well-formed XML.

The output can then be processed with Stanford CoreNLP using the following commands (for version 3.7.0).

Dependency Parser:

java -XX:+UseNUMA -Xmx3g -cp "/path/to/stanford-corenlp-full-2016-10-31/*" edu.stanford.nlp.pipeline.StanfordCoreNLP -pos.model edu/stanford/nlp/models/pos-tagger/english-caseless-left3words-distsim.tagger -parse.model edu/stanford/nlp/models/srparser/englishSR.beam.ser.gz -annotators tokenize,cleanxml,ssplit,pos,truecase,lemma,ner,depparse -parse.maxlen 100 -ssplit.eolonly true -truecase.overwriteText true -outputFormat json -file outputfile.xml

Full pipeline with Shift-Reduce parser with beam search (less robust!!):

java -XX:+UseNUMA -Xmx5g -XX:MaxMetaspaceSize=1g -Xss2048k -cp "/path/to/stanford-corenlp-full-2016-10-31/*" edu.stanford.nlp.pipeline.StanfordCoreNLP -pos.model edu/stanford/nlp/models/pos-tagger/english-caseless-left3words-distsim.tagger -parse.model edu/stanford/nlp/models/srparser/englishSR.beam.ser.gz -annotators tokenize,cleanxml,ssplit,pos,truecase,lemma,ner,parse,dcoref,relation,natlog,quote,sentiment -parse.maxlen 100 -ssplit.eolonly true -coref.algorithm neural -truecase.overwriteText true -outputFormat json -file outputfile.xml

Given the long setup time, it may make sense to use -filelist instead of -file to process multiple files at once.

Name		Name	Last commit message	Last commit date
Latest commit History 24 Commits
nonbreaking_prefixes		nonbreaking_prefixes
tests		tests
README.md		README.md
captioning_specials.tsv		captioning_specials.tsv
filter_metainfo_from_cclines.pl		filter_metainfo_from_cclines.pl
join_lines.pl		join_lines.pl
requirements.txt		requirements.txt
round_brackets_dictionary.txt		round_brackets_dictionary.txt
sentence_splitting.py		sentence_splitting.py
square_brackets_dictionary.txt		square_brackets_dictionary.txt
words_with_colons_dictionary.txt		words_with_colons_dictionary.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

sentence_splitting

Install:

Usage:

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 2

Languages

RedHenLab/sentence_splitting

Folders and files

Latest commit

History

Repository files navigation

sentence_splitting

Install:

Usage:

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 2

Languages

Packages