diff --git a/P5/Source/Guidelines/en/AI-AnalyticMechanisms.xml b/P5/Source/Guidelines/en/AI-AnalyticMechanisms.xml index 65ea77daed..508c1bcbd8 100644 --- a/P5/Source/Guidelines/en/AI-AnalyticMechanisms.xml +++ b/P5/Source/Guidelines/en/AI-AnalyticMechanisms.xml @@ -1,6 +1,8 @@ - + +
Simple Analytic Mechanisms

This chapter describes a module for associating simple analyses and @@ -23,7 +25,7 @@ text segments according to the familiar linguistic categories of generic seg element described in section .

Section introduces an additional global attribute which allows passages of text to be associated with -specialized elements representing their interpretation. +specialized elements representing their interpretation. These interpretative elements (span and interp) are described in detail in section . They allow the encoder to specify an analysis as a series of names and @@ -42,7 +44,7 @@ used to associate simple linguistic analysis with text segments.

category elements which may be used to represent the segmentation of a text into the traditional linguistic categories of sentence, clause, phrase, -word, morpheme, +word, morpheme, characters, and punctuation marks.

Words and Above @@ -50,7 +52,7 @@ a text into the traditional linguistic categories of constitutes a word or a sentence, these remain generally useful concepts. In this section we discuss elements provided for marking up linguistic -units down to the word level, however defined. +units down to the word level, however defined.

@@ -65,26 +67,26 @@ They also share attributes from att.typed: that text is permitted within a document, when the module defined by this chapter is included in a schema.

-

The w and pc elements belong to the att.linguistic class, which supplies -attributes that may be used for lightweight linguistic annotation (see section below): +

The w and pc elements belong to the att.linguistic class, which supplies +attributes that may be used for lightweight linguistic annotation (see section below):

-

Additionally, these elements also have access to the att.lexicographic.normalized class, -which supplies the attributes norm and orig: the former for handling -normalization/regularization at the word level, the latter providing the original form if the element -content is modernized or regularized. Note that these attributes are a local (word-level) alternative -to the robust mechanism that uses the choice, orig, and reg elements, +

Additionally, these elements also have access to the att.lexicographic.normalized class, +which supplies the attributes norm and orig: the former for handling +normalization/regularization at the word level, the latter providing the original form if the element +content is modernized or regularized. Note that these attributes are a local (word-level) alternative +to the robust mechanism that uses the choice, orig, and reg elements, discussed in section and in chapter . The choice-based mechanism is the default descriptive device, while the norm and orig attributes are used to -handle a subset of normalizations in linguistic contexts where a single sequence of tokens is a priority, for example -in historical corpora subject to linguistic analysis. It needs to be stressed that the simplified attribute-based +handle a subset of normalizations in linguistic contexts where a single sequence of tokens is a priority, for example +in historical corpora subject to linguistic analysis. It needs to be stressed that the simplified attribute-based mechanism is not meant to be used for editorial interventions. -The att.lexicographic.normalized class is also used in dictionary -entries, as discussed in chapter .

+The att.lexicographic.normalized class is also used in dictionary +entries, as discussed in chapter .

The s element may be used simply to segment a text end-to-end into a series of non-overlapping segments, referred to here and elsewhere as s-units, or sentences. -

+

Nineteen fifty-four, when I was eighteen years old, is held to be a crucial turning point in the history of the Afro-American — for the U.S.A. as a whole — the @@ -143,7 +145,7 @@ phrases, with no need to include segmentation at a higher level as well.

For verse texts, the overlapping of metrical and syntactic structure requires that special care be given to representing both using an element hierarchy. One simple approach is to split the syntactic phrases -into fragments when they cross verse boundaries, reuniting them +into fragments when they cross verse boundaries, reuniting them with the part attribute:

Tweedledum and Tweedledee @@ -160,7 +162,7 @@ Another approach is to use the next and prev attributes defined in the additional module for linking (chapter ): For Tweedledum said Tweedledee - + Had spoiled his nice new rattle. Other methods are also possible; for discussion, see chapter .

@@ -171,8 +173,8 @@ category. The function attribute on the cl and about the function of the category. Legal values for these two attributes are not defined by these Guidelines, but should be documented in the segmentation element of the -encodingDesc element within the document's header. -A general approach to the encoding of linguistic categories for +encodingDesc element within the document's header. +A general approach to the encoding of linguistic categories for parts of a text is discussed in section below.

Using traditional terminology, these attributes provide a convenient @@ -192,16 +194,16 @@ Such detailed encodings as the following may require careful formatting if they are to be easily readable however.

- + Nineteen fifty-four, when I was eighteen years old , - + is held - - + + to be a crucial turning point in @@ -218,16 +220,16 @@ formatting if they are to be easily readable however. the year - + segregation - + was outlawed by the U.S. Supreme Court . - + It - + was also a crucial year for me @@ -287,11 +289,11 @@ describes a phrase which consists of something mentioned.

The w element carries additional attributes -which may be of use in many indexing or analytic applications. The +which may be of use in many indexing or analytic applications. The lemma attribute may be used to specify the lemma, that is the head- or uninflected form of an inflected verb or noun, for example: - + timeo Danaos et @@ -308,7 +310,7 @@ pointer mechanisms. Assuming that a standardized lexicon for Latin is available at the location http://lexicon.org/latin.xml, we might for example revise the above example as: - + timeo Danaos @@ -325,11 +327,11 @@ such as morphemes, characters, or punctuation.

The m element is used to mark up morphologically identified segmentation below the word level. Analogous to the lemma attribute for w, there is a -baseForm attribute for the m element, +baseForm attribute for the m element, which may be used to indicate the base form of an inflected morpheme; where appropriate, m elements may also be organized hierarchically: - + com fort @@ -344,7 +346,7 @@ provide a means for those cases where it is considered helpful to distinguish lexical from sub-lexical tokens, to complement the more general mechanism already provided by the seg element, using which the above example could alternatively be marked up as follows: - + com fort @@ -443,12 +445,12 @@ named para:

A similar encoding can be used for hyphenation: - + A fire-proof vest is recom- -mended. +mended. - - Refer to for a discussion of the motivations for + + Refer to for a discussion of the motivations for explicitely recording the presence of hyphens.

@@ -457,7 +459,7 @@ together to give a fairly detailed low-level grammatical analysis of text. For example, consider the following segmentation of the English S-unit I didn't do it. I - + did n't @@ -472,8 +474,8 @@ component of the word didn't has something in common with the word do. A further advantage of segmenting the text down to this level is that it becomes relatively simple to associate each such segment with a more detailed formal analysis, for example by -providing a baseform, or morphological analysis at whichever level is appropriate. -This matter is taken up in detail in section . +providing a baseform, or morphological analysis at whichever level is appropriate. +This matter is taken up in detail in section .

@@ -489,6 +491,7 @@ This matter is taken up in detail in section . +
@@ -500,9 +503,9 @@ The ana attribute may be specified for any element. Its effect is to associate the element with one or more others representing an analysis or interpretation of it. Its target should be one of the elements described in the section below, -or some other interpretative element such as note, on which +or some other interpretative element such as note, on which see section or fs, -on which see chapter . If a hierarchical form of classification +on which see chapter . If a hierarchical form of classification is desired then it may point to category element at a suitable level in a taxonomy see .

@@ -535,7 +538,7 @@ grouping elements spanGrp and interpGrp. span and interp elements may be used to indicate that the annotations are of specific types, for example thematic or structural. The annotation itself is supplied as the content of the -span or interp element. +span or interp element. In the case of the span element, the span of text being annotated is indicated by values of the from, to or target attributes, used in combination as @@ -549,7 +552,7 @@ aggregating the contents of the (possibly non-contiguous) elements pointed to by is an error to supply only the to attribute; to supply more than one pointer value for either to or from attributes; or to supply either of these in conjunction with the -target attribute. +target attribute. In the case of interp (see below), the span is indicated by a pointer from a link element or some similar mechanism. The resp attribute indicates the annotator responsible for this annotation. @@ -570,7 +573,7 @@ attribute: phrasal verb "make up" This second approach might be cumbersome if the number of components -to be combined is very large. It is however essential if the +to be combined is very large. It is however essential if the components do not follow each other, as in this example: Didyoumakeitup @@ -617,7 +620,7 @@ convenient to group them within a spanGrp or interpGrp element

-

Spans may also be used to represent structural divisions within +

Spans may also be used to represent structural divisions within a narrative, particularly when these do not coincide with the structure implied by the element structure. Consider the following narrative: @@ -807,8 +810,8 @@ as borderline cases both the formal structural properties of the text information about its context (the circumstances of its production, its genre or medium). The structural properties of any TEI-conformant text should be represented using the structural elements discussed elsewhere -in this chapter and in chapters , , -, , , , +in this chapter and in chapters , , +, , , , and . The contextual properties of a TEI text are fully documented in the TEI header, which is discussed in chapter , and in section . @@ -845,25 +848,25 @@ into the quarry and never surfaced.

Our discussion focuses on the way that this sentence might be analysed using the CLAWS system developed at the University of Lancaster but exactly the same principles may be applied to a wide variety of other systems.For the word-class tagging method used by CLAWS see -; +; For an overview of the system see . The example sentence was processed using an online version of the CLAWS tagger at Output from the system consists of a segmented and tokenized version of the text, in which word class codes have been associated with each token. CLAWS offers outputs in a variety of non-XML and XML formats: for example, the simplest format for the sample sentence would be: -

This may be easily transformed into an equivalent TEI XML representation: -The -victim's -friends told -police that -Kruger drove into -the quarry -and never -surfaced +The +victim's +friends told +police that +Kruger drove into +the quarry +and never +surfaced Although the names used for the attribute values here may have some significance for the human reader (AT0 for @@ -887,7 +890,7 @@ in the header: Preposition Verb past tense - + If the codes are considered to be compositional (for example that NN1 and NN2 @@ -903,8 +906,8 @@ analysis into phrase and clause elements can be superimposed on the word and morpheme tagging in the preceding illustration. For example, CLAWS provides the following constituent analysis of the sample sentence (the word class codes have been deleted): -

Treating the labels on the brackets as phrase or clause interpretations, this analysis of the structure of the example sentence @@ -913,37 +916,37 @@ can be combined with the word class analysis and represented as follows phrase, has been replaced by V1, and V+, representing the second part, has been replaced by V2). - - + + The victim 's friends - + told police - + that Krueger - - + + drove - + into - + the quarry and - + never surfaced @@ -974,15 +977,15 @@ In this case, each linguistic segment to be annotated must be supplied with its xml:id attribute: -The -victim -'s friends -told police -that Kruger -drove into -the quarry -and never -surfaced +The +victim +'s friends +told police +that Kruger +drove into +the quarry +and never +surfaced Each segment-interpretation pair may now be represented by means of a link element inside an appropriate linkGrp element: @@ -1028,7 +1031,7 @@ containers for grammatical information, and a standardized way of filling them i attributes that belong to the att.linguistic class:

- +

The essence of lightweight linguistic annotation is that the basic grammatical information is encapsulated at the word level, together with the orthographic shape of the word. This has clear advantages for automatic processing but, on the other hand, this form of data @@ -1039,8 +1042,8 @@ part-of-speech or morphosyntactic descriptions, single values have to fit into t attributes). Another important principle that this kind of annotation is sensitive to is the need for (near) homomorphism between the assumed tokenization (division of the text stream into minimal units) and the division into minimal syntactic units (word forms, -in the terminology of ISO Morpho-Syntactic Framework, ISO 24611All -definitions contained within ISO standards can be accessed at the ISO Online Browsing Platform. For ISO MAF, see +in the terminology of ISO Morpho-Syntactic Framework, ISO 24611All +definitions contained within ISO standards can be accessed at the ISO Online Browsing Platform. For ISO MAF, see .), because it is the former that results from the process of tokenization, but the latter that can be lemmatized and meaningfully described by means of grammatical features. Where tokens are only minimally mismatched with @@ -1050,7 +1053,7 @@ disjoint tokens). Beyond that, more robust TEI mechanisms, based on standoff pri feature structures, should replace lightweight annotation.

-

The basic grammatical information encoded by means of +

The basic grammatical information encoded by means of att.linguistic is sufficient for the purpose of enhancing queries or improving the analysis of search results by, for example, making it possible to distinguish between the noun cut and the identically spelled verb @@ -1065,7 +1068,7 @@ base form (also called bare infinitive), so the value of lemma the verbal forms write, writes, wrote, written, and writing is typically write.

- +

Together with the span-delimiting elements mentioned in this section, such as s, cl, or phr, lightweight grammatical annotation may be used to build basic syntactic constituency structures, where hierarchical information is expressed through @@ -1075,21 +1078,21 @@ syntactic structures, one needs to turn to more robust mechanisms offered by the may involve graph description (see chapter ) or standoff techniques (see section ), and where grammatical labels may need to be annotated by means of feature structures (see chapter ).

- -

Some of the above-mentioned robust methods will also prove handy in cases where more than one tagset -(label inventory) is used to label the words, or where automatic morphological analysis yields multiple -possibilities (for example, the form cutting is morphologically ambiguous between -verbal, adjectival, and nominal) and needs to be followed by (often also automatic) disambiguation in -morphosyntactic contexts, with varying probabilities that may also need to be recorded together with their + +

Some of the above-mentioned robust methods will also prove handy in cases where more than one tagset +(label inventory) is used to label the words, or where automatic morphological analysis yields multiple +possibilities (for example, the form cutting is morphologically ambiguous between +verbal, adjectival, and nominal) and needs to be followed by (often also automatic) disambiguation in +morphosyntactic contexts, with varying probabilities that may also need to be recorded together with their corresponding part-of-speech and morphosyntactic values.

- -

It should be borne in mind that tokenization, lemmatization, part-of-speech identification, and -morphosyntactic labelling, especially when performed automatically, should in most cases be seen as -involving pragmatic decisions, dictated by concrete practical goals, economy of description, or the -demands of particular analytic and/or visualization tools. It comes therefore as no surprise that -numerous alternative (and often conflicting) lemmatization strategies and tagsets exist, in use by -various communities and various tools, and that they change with time (a case in point is the CLAWS -tagset for English, with several versions that merge the part-of-speech and morphosyntactic information + +

It should be borne in mind that tokenization, lemmatization, part-of-speech identification, and +morphosyntactic labelling, especially when performed automatically, should in most cases be seen as +involving pragmatic decisions, dictated by concrete practical goals, economy of description, or the +demands of particular analytic and/or visualization tools. It comes therefore as no surprise that +numerous alternative (and often conflicting) lemmatization strategies and tagsets exist, in use by +various communities and various tools, and that they change with time (a case in point is the CLAWS +tagset for English, with several versions that merge the part-of-speech and morphosyntactic information to various degrees). Given that the English language has relatively poor inflectional morphology, the decision to merge part-of-speech symbols with morphosyntactic features (as @@ -1097,32 +1100,32 @@ in, e.g., CLAWS-7, where the value PPHO1 signals the 3rd person singu pronoun) is fully justified as the most economical approach. For languages with more robust inflection, the pos and msd attributes will either be used separately, or the part-of-speech information will be merged into the morphosyntactic -description. The nature and description of these systems is outside the scope of these -Guidelines, but it has to be stressed that all the strategies adopted for linguistic annotation, -even at the lightweight level of complexity, must be documented in the header of the -given electronic resource, not only for the purpose of guaranteeing successful data interpretation and exchange, but +description. The nature and description of these systems is outside the scope of these +Guidelines, but it has to be stressed that all the strategies adopted for linguistic annotation, +even at the lightweight level of complexity, must be documented in the header of the +given electronic resource, not only for the purpose of guaranteeing successful data interpretation and exchange, but also for the sake of sustainability of the results of the given project.

- +

The last of the att.linguistic attributes, join, has the most text-technological flavour. It can be used to amend the loss of whitespace-related information in non-inline -markup.

+markup.

Compare the following two listings. The first difference between them is in the tagset used (CLAWS-5 vs. CLAWS-7) and only serves to exemplify the need to document the choice of descriptive vocabulary in the header, lest the encoded information is unreadable or confusing. The second difference is the difference in the treatment of inter-token whitespace, and it is here that the join attribute proves indispensable.

- +

The first example listing uses CLAWS-5 and inline annotation, where whitespace serves as part of the markup: -The victim's friends - told police that Kruger - drove into the quarry +The victim's friends + told police that Kruger + drove into the quarry and never surfaced.

- +

In the second example, the attribute join is the only way to encode whether two -tokens are adjacent or not: +tokens are adjacent or not: The victim @@ -1170,15 +1173,15 @@ either record the original spelling (when the corpus text is normalized) or to r normalized variants (when the element content of the corpus preserves the original spelling). The attribute class att.lexicographic.normalized can be used for that purpose:

-

The first fragment below comes from "Gottfried, Newe Welt Vnd Americanische Historien. Frankfurt/M., 1631" +

The first fragment below comes from "Gottfried, Newe Welt Vnd Americanische Historien. Frankfurt/M., 1631" encoded in the Deutsches Textarchiv and records normalized forms in the norm attribute. vnuermuthete Freundſchafft angebotten -

-

The following example comes from the EarlyPrint project and uses the attribute orig to -record the original spelling (note that the xml:id attributes have been removed for the +

+

The following example comes from the EarlyPrint project and uses the attribute orig to +record the original spelling (note that the xml:id attributes have been removed for the sake of readability). he @@ -1187,6 +1190,168 @@ sake of readability). forth

+ +
+Syntactic dependency relations between word-level elements +

For the purpose of encoding linguistic annotations of syntactic dependency +relations between word-level elements, the Guidelines provide attributes that +belong to the att.linguistic.dependency class: + + +

+

Syntactic dependency relations such as subject or +object between words and other tokens are usually +represented by means of dependency graphs. A simple example for such a graph is +given below. +

+ +
+In this directed graph, every node is labelled with one token of the sentence +She buys books. The nodes are connected by arcs which +represent the relation of syntactic superordination. Read in +inverse direction, the arcs correspond to the relation of +syntactic subordination or dependency, holding between a +dependent node and a head node each. Arcs, in turn, +are labelled with the type of the dependency relation +(nsubj, obj, +punct).The labels for types of +dependency relations are taken from the Universal Dependencies (UD) inventory +(cf. ). +For instance, in the case of the arc labelled with obj and connecting +the nodes labelled with the tokens buys and +books, the latter node is said to be the +dependent node and the former its head node.

+

An equivalent rendition of this graph is given below, where the direction of +the arrowless arcs is implicitly given by the vertical dimension. For +convenience, the horizontal positions of the nodes are made explicit in terms of +numerical, 1-based, token indices. +

+ +
+As it is implied from this rendition, the node with token index 2 (labelled with +buys) is the root of the dependency graph: +from this node, every other node in the graph can be reached on a path of arcs +by following them from top to bottom.

+

Using the depN and depR attributes defined by the att.linguistic.dependency class, the dependency relations +in the above example can be encoded in TEI XML as follows: + + + She + buys + books + . + + +Here, depN attributes represent the indices of the tokens marked up +with w and pc elements. Dependency relations are encoded by +means of depR attributes of the w and pc elements +in terms of a colon-separated pair consisting of the token index of their head +node and a label for the type of the dependency relation.

+

A widely used notation for annotations of dependency relations and other +linguistic information is the plain-text, column-based CoNLL-U +format, used by Universal +Dependencies (UD) as well as by related frameworks. A +CoNLL-U representation of the example at hand is given below: +1 She she PRON _ _ 2 nsubj _ _ +2 buys buy VERB _ _ 0 root _ _ +3 books book NOUN _ _ 2 obj _ SpaceAfter=No +4 . . PUNCT _ _ 2 punct _ _ +The first column gives the indices of the tokens in the current sentence, the +second column the tokens themselves, the third column the corresponding lemma, +the forth column a UD part-of-speech tag, the seventh column the token index of +the syntactic head of the current token, and the eighth column a label of the +type of the dependency relation between the token and its head. The very last, +tenth, column can be used for miscellaneous information, such as whether or not +there is space to add after the token when joining them. Empty columns are +marked with _. Note that by convention, the implicit root +buys with token index 2 is explicitly marked by the +dependency relation root, pointing to an unrepresented, +virtual, root node with token index 0.

+

This virtual root node is made explicit in the +corresponding dependency graph below. +

+ +
+Here, the node with token index 2 and label buys is +connected to an unlabelled node with token index 0 via an arc labelled with +root. Except for the virtual root +node, each node is, in addition, annotated with lemmas and part-of-speech +tags.

+

In order to encode the linguistic annotation from the above CoNLL-U example, +the attributes pos, lemma, and join defined by +the att.linguistic class can be used together with +the depN and depR attributes: + + + She + buys + books + . + + +As in the previous example, depN attributes represent the indices of +the tokens marked up with w and pc elements; the +index 0 for the virtual root node is left +implicit. Dependency relations are again encoded by means of depR +attributes of the w and pc elements. Note that the w +element with the token buys in its content now also has a +depR attribute. Part-of-speech tags and lemmas are given as values +of pos and lemma attributes. Last, but not least, the +join attribute can be used for information on whether a token is +adjacent to the tokens on its left-hand or right-hand side when joining +them.

+

A more complex example in the CoNLL-U format is given below:This example is adapted from the documentation of the CoNLL-U +format at . The +tagging follows the conventions of the UD English +EWT corpus. +1 She she PRON PRP Case=Nom|Gender=Fem|Number=Sing|Person=3|PronType=Prs 2 nsubj 2:nsubj|4:nsubj _ +2 buys buy VERB VBZ Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin 0 root 0:root _ +3 and and CCONJ CC _ 4 cc 4:cc _ +4 sells sell VERB VBZ Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin 2 conj 0:root|2:conj _ +5 books book NOUN NNS Number=Plur 2 obj 2:obj|4:obj SpaceAfter=No +6 . . PUNCT . _ 2 punct 2:punct _ +In this linguistic annotation of the sentence She buys and sells +books. with coordination ellipsis, there are additional non-empty +columns. The fifth column provides a concurrent part-of-speech tagging using a +non-UD tagset. The sixth column adds a pipe-separated list of morphosyntactic +features. And the ninth column encodes an enhanced dependency graph with +additional dependency relations.

+

As can be seen from the graphic rendition below, there are dependent nodes +with multiple arcs, connecting them with more than one head node. For +convenience, those additional arcs are rendered with arrowheads. +

+ +

+

The above representation makes an explicit claim that the pronoun +she is the subject of both the verb buy and +the verb sell, while the noun books acts as +the object of both these verbs.

+

The above example may be encoded in TEI XML as follows: + + + She + buys + and + sells + books + . + +

+

As in the CoNLL-U format, the enhanced dependency graph is represented by pipe-separated list values of +depE attributes. The verb form sells with +token index 4, for instance, has 0:root|2:conj as its depE +value, to be interpreted as being both the dependent node of a root relation to +the virtual root node with index 0 as well as the dependent +node of a conjunct relation to the node with index 2 (the verb form +buys). Concurrent part-of-speech tags are given as +space-separated list values of pos attributes, and msd +attributes are used to encode the morphosyntactic UD features.

+
+
Spoken Text

The mechanisms proposed in this chapter may also be used to encode analyses of an entirely different kind, for example discourse function. diff --git a/P5/Source/Guidelines/en/Images/dependency1.png b/P5/Source/Guidelines/en/Images/dependency1.png new file mode 100644 index 0000000000..c8ab749b68 Binary files /dev/null and b/P5/Source/Guidelines/en/Images/dependency1.png differ diff --git a/P5/Source/Guidelines/en/Images/dependency2.png b/P5/Source/Guidelines/en/Images/dependency2.png new file mode 100644 index 0000000000..d0907cdc7f Binary files /dev/null and b/P5/Source/Guidelines/en/Images/dependency2.png differ diff --git a/P5/Source/Guidelines/en/Images/dependency3.png b/P5/Source/Guidelines/en/Images/dependency3.png new file mode 100644 index 0000000000..5a5be0db7a Binary files /dev/null and b/P5/Source/Guidelines/en/Images/dependency3.png differ diff --git a/P5/Source/Guidelines/en/Images/dependency4.png b/P5/Source/Guidelines/en/Images/dependency4.png new file mode 100644 index 0000000000..f8c789c50f Binary files /dev/null and b/P5/Source/Guidelines/en/Images/dependency4.png differ diff --git a/P5/Source/Specs/att.linguistic.dependency.xml b/P5/Source/Specs/att.linguistic.dependency.xml new file mode 100644 index 0000000000..7f1c435150 --- /dev/null +++ b/P5/Source/Specs/att.linguistic.dependency.xml @@ -0,0 +1,184 @@ + + + + + + provides attributes for + annotating dependency relations between word-level elements. + + + + token index + encodes the token index or index + range of a word-level element in a syntactic dependency relation. + + is usually an + integer or a hyphen-separated range of integers; for special cases one + may also use specific notations like 5.1 (cf. the example below). + + + syntactic dependency relation + encodes a dependency relation + between the element bearing this attribute and its syntactic head in + terms of a pair consisting of the head’s token index and the type of the dependency relation. + + should use a colon to + separate the head’s token index and the type of + the dependency relation. + + + enhanced syntactic dependency relations + encodes enhanced syntactic + dependency relations between the element bearing this attribute and its + syntactic heads in terms of a list of pairs each consisting of the + head’s token index and the type of the dependency + relation. + + should use a colon to + separate the head’s token index and the type of + the dependency relation, and a pipe symbol as a list-item separator. + + + +

Universal Dependency (UD) annotation of the English sentence She buys + books..

+ + + She + buys + books + . + + + + +

Universal Dependency (UD) annotation of the English sentence She buys + and sells books..

+ + + She + buys + and + sells + books + . + + +
+ +

Universal Dependency (UD) annotation of the English sentence + She likes coffee and he tea.

+

This gapping example shows the difference between basic dependency + relations, here with an elided (absent) second occurrence of + likes, and expanded dependency relations, here + featuring an empty syntactic atom (with token index 5.1). (For + the representation of empty nodes in UD, cf. + .)

+ + + She + likes + coffee + and + he + likes + tea + . + + +
+ +

Universal Dependency (UD) annotation of the German noun phrase ein Hotel zum Wohlfühlen.

+

This example illustrates the handling of a surface word form like + zum (with token index range 3-4) which + is analysed here as a contraction of the syntactic atoms zu + and dem (with token indices 3 and + 4). (For the handling of index ranges in UD, cf. + .)

+ + + ein + Hotel + zum + zu + dem + Wohlfühlen + + +
+ +

In the UD examples above, a depR value of 0:root by + convention refers to an unrepresented, virtual, root + node. (For discussion, cf. section .)

+

The following table summarises the serialisation choices for UD’s CoNLL-U format in TEI XML. + + + CoNLL-U column + TEI representation + TEI class + + + ID + depN + att.linguistic.dependency + + + FORM + w or pc content + model.segLike + + + LEMMA + lemma + att.linguistic + + + UPOS + pos + att.linguistic + + + XPOS + additional list item in pos + att.linguistic + + + FEATS + msd + att.linguistic + + + HEAD + depR, part before : + att.linguistic.dependency + + + DEPREL + depR, part after : + att.linguistic.dependency + + + DEPS + depE + att.linguistic.dependency + + + SpaceAfter=No in MISC + join with value right + att.linguistic + +
+

+

The syntax of the FEATS and DEPS columns, with + pipe-separated list values, can be used as-is for the values of + msd and depE. The syntax of the items in + depE values, with a colon separating the head’s token index and + the type of the dependency relations, should also be used for depR.

+
+ + + + diff --git a/P5/Source/Specs/att.linguistic.xml b/P5/Source/Specs/att.linguistic.xml index 95f0fb7140..ebc89989c6 100644 --- a/P5/Source/Specs/att.linguistic.xml +++ b/P5/Source/Specs/att.linguistic.xml @@ -5,6 +5,7 @@ provides a set of attributes concerning linguistic features of tokens, for usage within token-level elements, specifically w and pc in the analysis module. + @@ -173,12 +174,11 @@

These attributes make it possible to encode simple language corpora and to add a layer of linguistic information to any tokenized resource. See section for discussion.

-

These guidelines provide no semantic basis or suggested - precedence when both lemma and lemmaRef are - provided. For this reason simultaneous use of both is not - recommended for interchange unless documentation explaining the - use is provided, probably in an ODD customization.

- +

These guidelines provide no semantic basis or suggested precedence when both lemma + and lemmaRef are provided. For this reason simultaneous use of both is not + recommended for interchange unless documentation explaining the use is provided, probably in + an ODD customization.

+