Skip to content

Loading Texts

Nils Reiter edited this page May 25, 2018 · 12 revisions

2.1.0

This guide is also available as a vignette in the R console: vignette(Loading-Texts).


In order to use the analysis functions in this package, you need NLP-pre-processed dramatic texts. To facilitate this, we are providing downloadable packages of pre-processed files in the correct format etc. Currently, we provide only the TextGrid repository as a collection, but others are planned. The current state of each data package is documented in this repository.

Package Setup

First, you will need to find a directory in which the data files are to be stored. By default, a directory called QuaDramA will be created in your home directory (~). On Windows, this is known as the Documents-folder.

If you want to specify a different folder, you can do with the setup()-function:

setup(dataDirectory = "YOUR_FOLDER")

The setup function is responsible for starting the Java Virtual Machine in the background, which we need in order to load the pre-processed texts. It also sets the option qd.datadir to the given directory, such that other functions know this. Whenever the R session is restarted, this function needs to be called again.

Installation of Data

To install a data package, you simply run the function installData(). The download size is approximately 700MB, so it might take a while. Luckily, this only needs to be done once.

installData()

In addition, we provide a categorization of these texts into genres. It can be downloaded in a similar way.

installCollectionData()

Loading of Dramatic texts

Each dramatic text in one of our data packages has an id and we load texts using this id. The id itself is just a sequence of alphanumeric symbols. The function loadAllInstalledIds() gives you a list of all installed ids. This is not particularly helpful, that’s why we keep an overview in the corpus repository.

Most functions expect text that is loaded by using the function loadSegmentedText().

library(DramaAnalysis)
setup()
text <- loadSegmentedText("tg:rksp.0")
colnames(text)
## NULL

The resulting data.frame contains 18 columns, with each word on a separate row. Column names are displayed above: Columns 1, 2, and 16 are about the entire dramatic text (corpus, id, and length in tokens), columns 3-8 are about the segment(s) a word appears in, columns 9-12 are about utterances: begin and end character position of the utterance, and the speaker name and numeric id. Columns 13-15 contain the surface, part of speech and lemma of each token, and columns 17 and 18 show information about mentioned characters.

summary(text)
##    Length     Class      Mode 
##         1 character character

Clone this wiki locally