| current version | 0.7.3 |
|---|
This documentation describes the tools for processing transcripts of
spoken data in
TEI. Some
of the tools can also be applied to TEI documents which are at least
<w>-annotated.
In principle, target documents are those conforming to the ISO standard ISO 24624:2016(E) ‘Language resource management -- Transcription of spoken language’.
This is the library containing the tools. A companion suite of webservices called TEILicht (source) offers the functionality to be accessed as RESTful web services, i.e. you can upload documents and download them in the processed form. The web services are also available via WebLicht, but by processing TEI-encoded files they break with WebLicht’s convention of processing TCF files.
All functions are also accessible from the command line. Try:
(after building in root directory)
java -cp 'target/dependency/*' -jar target/teispeechtools-VERSION.jar(in the directory containing the jar)
java -cp 'dependency/*' -jar teispeechtools-VERSION.jarand follow the help. Together with this description, you should get along well. If not, contact Bernhard Fisseni
A simple wrapper script is available.
The names of CLI commands corresponds to those of the TEILicht web services.
The names of the options CLI correspond to those of the parameters
used in the Java library. However, some abbreviations are possible
and CamelCase was converted to
kebab-case.
To see all options, execute the wrapper script once.
-
Input
a plain text file containing a transcription in the Simple EXMARaLD...A format. This format permits to encode utterances and overlap between them as well as incidents occurring independently or simultaneously to an utterance and a commentary (e.g. a translation) on utterances or incidents. -
Output
a transcription file conforming to the TEI specification which is split into utterances:<annotationBlock>and<u>,<incident>elements together with<spanGrp>s containing the commentary. A<timeline>is derived from the text, and all annotation is situated with respect to the<timeline>. -
Parameters
- the
languageof the utterance. If the header of the document specifies a language, the header value will take precedence.
- the
-
Input
a TEI-conformant XML document containing<u>elements which contain plain text formatted according to a transcription convention (generic, cGAT minimal, cGAT basic) and potentially<anchor>elements referring to the<timeline>. This can be the output of the previously described plain text conversion. -
Output
a TEI-conformant XML document in which the<u>elements have been segmented into words<w>on the one hand and conventions have been resolved to XML markup like<pause>,<gap>etc. -
Parameters
- the
languageof the document (if there is language information in the document, it will be preferred), - the transcription convention which the text contents of the
<u>follows. Currentlygeneric, (cGAT)minimaland (cGAT)basicare supported.
- the
-
Input
a TEI-conformant XML document containing<u>elements which contain either plain text and<anchor>elements only or have been analysed into<w>(other contents possible). In the first case, the whole text content (excluding) will be processed; in the latter case, only the contents of<w>elements will be processed. -
Output
a TEI-conformant XML document where the<u>have been annotated with@xml:langattributes where the algorithm reached a decision. Cases of doubt are reported in XML comments. If languages are equally probable, the document language is preferred. -
Parameters
-
the
languageof the document (if there is language information in the document, it will be preferred), -
the
expected languagesof the document, to constrain the search space, which contains 103 languages, for language detection. The more precisely you know which languages are expected, the better detection will work.The web service variant also accepts parameters
exspected1,exspected2,…exspected5which contain single language codes. -
the
minimal countof words so that language detection is even tried (default: 5, which is already pretty low). -
whether to
forcelanguage detection, even if a language tag has already been assigned to<u>.
-
-
Input
a TEI-conformant XML document containing<u>elements which have been analysed into<w>(other contents possible). -
Output
a TEI-conformant XML document where the<w>have been annotated with a@normattribute containing the normalized form. Currently, normalization is only applied to text in German. -
Parameters
- the
languageof the document (if there is language information in the document, it will be preferred). - whether to
forcenormalization, even if<w>s already have@normattributes.
- the
This service is based on the algorithm in OrthoNormal (German
description only).
DictionaryNormalizer,
which applies normalization based on dictionaries from the
FOLK and the
DeReKo corpora.
It basically searches for <w> elements and applies normalization to
their text
content.
Normalization:
- Step 1: The most frequent normalization for a word form in the FOLK corpus is applied.
- Step 2: If nothing is found in Step 1, the list of words that occur capitalized-only in the Deutsches Referenzkorpus (DeReKo) is consulted and a normalization is chosen.
- Step 3: Out-of-dictionary words are left as is.
-
Input
a TEI-conformant XML document containing<u>elements which have been analysed into<w>(other contents possible). -
Output
a TEI-conformant XML document where the<w>have been pos-tagged (@posattribute) and lemmatized (@lemmaattribute) with the TreeTagger. -
Parameters
- the
languageof the document (if there is language information in the document, it will be preferred). - whether to
forcetagging, even if a pos tag has already been assigned to<w>.
- the
-
Input
a TEI-conformant XML document- containing
<u>elements which have been analysed into<w>(other contents possible) and - a timeline that conforms to the following constraints:
- if no
--timeparameter is given, it must be specified in the document as follows: the<tei:timeline>contains a leading<tei:when>that has@intervalof 0 to itself; there is a last<tei:when>element that refers to this element (@since) and the@intervalwill be the duration. - if no
--offsetparameter is given, the@intervalbetween the first and second<tei:when>will be considered the offset. Offsets are always positive.
- if no
- containing
-
Output
a TEI-conformant XML document where- the
<w>have been assigned a proportion of utterance corresponding to the number of letters or IPA signs in the<w>. - duration information for
<u>will be in the<tei:timeline>.
- the
-
Parameters
- the
languageof the document (if there is language information in the document, it will be preferred) - whether to
forcetagging, even if a pos tag has already been assigned to<w> - whether to
transcribeusing BAS Web Services. See the BAS documentation ("runG2P") for the supported locales (non-ISO-639 codes likenzeare not supported here). The service will do some adjustment to be able to transcribe (e.g., acceptltzand not just the fullltz-LUfor Luxemburgish). Transcription is only used if phones are used for pseudoalignment, see next option - whether to
usePhonesfor pseudoalignment. If transcription using BAS' web service is possible andusePhonesis true, the transcription will be used to guess the proportion of utterance duration to assign to the<w>. If no transcription is possible, or transcription is disabled, the number of letters will be used to pseudo-align, and no transcription will be added. - the
timeduration of the utterance. Setting the time to -1 (default) means that it will be derived from the document as described above. - the
offsetof the utterance, i.e. the time of the first timeline event. Setting the offset to -1 (default) means that it will be derived from the document as described above. every, a number of items after which to insert anchors
- the
For some operations, it must be possible to address all structural
elements with an @xml:id attribute. identify adds @xml:id
attributes to all TEI elements that do not have one, and @unidentify
removes such attributes whose form suggests they have been added by
identify.
Besides the dependencies available via
Maven, needs some utility
functions. These can be
locally mvn installed.
mvn install dependency:copy-dependenciesinstalls the package locally with Maven and copies the dependencies to
target/dependency. You can use this package as a library then, or use
the CLI (see below). If you only want to use the CLI, you can run:
mvn package dependency:copy-dependenciesThis will pull in the sources and make a runnable JAR
To speed up loading time, one can compile the dictionary, which combines the FOLK and DeReKo dictionaries described above. The combined dictionary is contained in downloads. From the root directory of the project, execute:
java -cp 'target/teispeechtools-VERSION.jar:target/dependency/*' \
de.ids.mannheim.clarin.teispeech.tools.DictMakerTo check the files with regular expressions for transcription conventions by running:
java -cp 'target/dependency/*:target/teispeechtools-VERSION.jar' \
de.ids.mannheim.clarin.teispeech.tools.PatternReaderRunner -hto get help. E.g.,
java -cp 'target/dependency/*:target/teispeechtools-VERSION.jar' \
de.ids.mannheim.clarin.teispeech.tools.PatternReaderRunner \
-i src/main/xml/Patterns.xml --language universal --level 2(only levels 2 and 3 are available)