-
Notifications
You must be signed in to change notification settings - Fork 3
Loading Texts
This guide is also available as a vignette in the R console: vignette(Loading-Texts).
In order to use the analysis functions in this package, you need NLP-pre-processed dramatic texts. To facilitate this, we are providing downloadable packages of pre-processed files in the correct format etc. Currently, we provide only the TextGrid repository as a collection, but others are planned. The current state of each data package is documented in this repository.
First, you will need to find a directory in which the data files are to
be stored. By default, a directory called QuaDramA will be created in
your home directory (~). On Windows, this is known as the
Documents-folder.
If you want to specify a different folder, you can do with the
setup()-function:
setup(dataDirectory = "YOUR_FOLDER")The setup function is responsible for starting the Java Virtual Machine
in the background, which we need in order to load the pre-processed
texts. It also sets the option qd.datadir to the given directory, such
that other functions know this. Whenever the R session is restarted,
this function needs to be called again.
To install a data package, you simply run the function installData().
The download size is approximately 700MB, so it might take a while.
Luckily, this only needs to be done once.
installData()In addition, we provide a categorization of these texts into genres. It can be downloaded in a similar way.
installCollectionData()Each dramatic text in one of our data packages has an id and we load
texts using this id. The id itself is just a sequence of alphanumeric
symbols. The function loadAllInstalledIds() gives you a list of all
installed ids. This is not particularly helpful, that’s why we keep an
overview in the corpus
repository.
Most functions expect text that is loaded by using the function
loadSegmentedText().
library(DramaAnalysis)
setup()
text <- loadSegmentedText("tg:rksp.0")colnames(text)## NULL
The resulting data.frame contains 18 columns, with each word on a separate row. Column names are displayed above: Columns 1, 2, and 16 are about the entire dramatic text (corpus, id, and length in tokens), columns 3-8 are about the segment(s) a word appears in, columns 9-12 are about utterances: begin and end character position of the utterance, and the speaker name and numeric id. Columns 13-15 contain the surface, part of speech and lemma of each token, and columns 17 and 18 show information about mentioned characters.
summary(text)## Length Class Mode
## 1 character character