diff --git a/P5/Source/Guidelines/en/AI-AnalyticMechanisms.xml b/P5/Source/Guidelines/en/AI-AnalyticMechanisms.xml index 65ea77daed..508c1bcbd8 100644 --- a/P5/Source/Guidelines/en/AI-AnalyticMechanisms.xml +++ b/P5/Source/Guidelines/en/AI-AnalyticMechanisms.xml @@ -1,6 +1,8 @@ - + +
This chapter describes a module for associating simple analyses and
@@ -23,7 +25,7 @@ text segments according to the familiar linguistic categories of
generic
Section
The
The
Additionally, these elements also have access to the
Additionally, these elements also have access to the
The
+
For verse texts, the overlapping of metrical and syntactic structure
requires that special care be given to representing both using an
element hierarchy. One simple approach is to split the syntactic phrases
-into fragments when they cross verse boundaries, reuniting them
+into fragments when they cross verse boundaries, reuniting them
with the Using traditional terminology, these attributes provide a convenient
@@ -192,16 +194,16 @@ Such detailed encodings as the following may require careful
formatting if they are to be easily readable however.
The The A similar encoding can be used for hyphenation:
-
+
Nineteen fifty-four, when I was eighteen years old,
is held to be a crucial turning point in the history of
the Afro-American — for the U.S.A. as a whole — the
@@ -143,7 +145,7 @@ phrases, with no need to include segmentation at a higher level as well.
-
-
+
http://lexicon.org/latin.xml, we might for example revise the above
example as:
-
+
-mended.
+mended.
Spans may also be used to represent structural divisions within +
Spans may also be used to represent structural divisions within
a narrative, particularly when these do not coincide with the
structure implied by the element structure. Consider the following narrative:
@@ -807,8 +810,8 @@ as borderline cases both the formal structural properties of the text
information about its context (the circumstances of its production, its
genre or medium). The structural properties of any TEI-conformant text
should be represented using the structural elements discussed elsewhere
-in this chapter and in chapters
Our discussion focuses
on the way that this sentence might be analysed using the CLAWS system
developed at the University of Lancaster but exactly the same
principles may be applied to a wide variety of other systems.
This may be easily transformed into an equivalent TEI XML representation:
-
Treating the labels on the brackets as phrase or clause
interpretations, this analysis of the structure of the example sentence
@@ -913,37 +916,37 @@ can be combined with the word class analysis and represented as follows
phrase, has been replaced by
-
-
The essence of lightweight linguistic annotation is that the basic grammatical information
is encapsulated at the word level, together with the orthographic shape of the word. This
has clear advantages for automatic processing but, on the other hand, this form of data
@@ -1039,8 +1042,8 @@ part-of-speech or morphosyntactic descriptions, single values have to fit into t
attributes). Another important principle that this kind of annotation is sensitive to is the
need for (near) homomorphism between the assumed tokenization (division of the text stream
into minimal units) and the division into minimal syntactic units (
The basic grammatical information encoded by means of +
The basic grammatical information encoded by means of
Together with the span-delimiting elements mentioned in this section, such as
Some of the above-mentioned robust methods will also prove handy in cases where more than one tagset
-(label inventory) is used to label the words, or where automatic morphological analysis yields multiple
-possibilities (for example, the form
Some of the above-mentioned robust methods will also prove handy in cases where more than one tagset
+(label inventory) is used to label the words, or where automatic morphological analysis yields multiple
+possibilities (for example, the form
It should be borne in mind that tokenization, lemmatization, part-of-speech identification, and -morphosyntactic labelling, especially when performed automatically, should in most cases be seen as -involving pragmatic decisions, dictated by concrete practical goals, economy of description, or the -demands of particular analytic and/or visualization tools. It comes therefore as no surprise that -numerous alternative (and often conflicting) lemmatization strategies and tagsets exist, in use by -various communities and various tools, and that they change with time (a case in point is the CLAWS -tagset for English, with several versions that merge the part-of-speech and morphosyntactic information + +
It should be borne in mind that tokenization, lemmatization, part-of-speech identification, and
+morphosyntactic labelling, especially when performed automatically, should in most cases be seen as
+involving pragmatic decisions, dictated by concrete practical goals, economy of description, or the
+demands of particular analytic and/or visualization tools. It comes therefore as no surprise that
+numerous alternative (and often conflicting) lemmatization strategies and tagsets exist, in use by
+various communities and various tools, and that they change with time (a case in point is the CLAWS
+tagset for English, with several versions that merge the part-of-speech and morphosyntactic information
to various degrees).
The last of the att.linguistic attributes,
Compare the following two listings. The first difference between them is in the
tagset used (CLAWS-5 vs. CLAWS-7) and only serves to exemplify the need to document the
choice of descriptive vocabulary in the header, lest the encoded information is unreadable or
confusing. The second difference is the difference in the treatment of inter-token
whitespace, and it is here that the
The first example listing uses CLAWS-5 and inline annotation, where whitespace serves as
part of the markup:
In the second example, the attribute
The first fragment below comes from "Gottfried, Newe Welt Vnd Americanische Historien. Frankfurt/M., 1631" +
The first fragment below comes from "Gottfried, Newe Welt Vnd Americanische Historien. Frankfurt/M., 1631"
encoded in the Deutsches Textarchiv and records normalized forms in the
The following example comes from the EarlyPrint project and uses the attribute
The following example comes from the EarlyPrint project and uses the attribute
For the purpose of encoding linguistic annotations of syntactic dependency
+relations between word-level elements, the Guidelines provide attributes that
+belong to the
Syntactic dependency relations such as
+
nsubj
, obj
,
+punct
).obj
and connecting
+the nodes labelled with the tokens
An equivalent rendition of this graph is given below, where the direction of
+the arrowless arcs is implicitly given by the vertical dimension. For
+convenience, the horizontal positions of the nodes are made explicit in terms of
+numerical, 1-based, token indices.
+
+
Using the
+
+
A widely used notation for annotations of dependency relations and other
+linguistic information is the plain-text, column-based _. Note that by convention, the implicit root
+
This
+
root
. Except for the
In order to encode the linguistic annotation from the above CoNLL-U example,
+the attributes
+
+
A more complex example in the CoNLL-U format is given below:
As can be seen from the graphic rendition below, there are dependent nodes
+with multiple arcs, connecting them with more than one head node. For
+convenience, those additional arcs are rendered with arrowheads.
+
+
The above representation makes an explicit claim that the pronoun
+
The above example may be encoded in TEI XML as follows:
+
+
+
As in the CoNLL-U format, the enhanced dependency graph is represented by pipe-separated list values of
+
The mechanisms proposed in this chapter may also be used to encode
analyses of an entirely different kind, for example discourse function.
diff --git a/P5/Source/Guidelines/en/Images/dependency1.png b/P5/Source/Guidelines/en/Images/dependency1.png
new file mode 100644
index 0000000000..c8ab749b68
Binary files /dev/null and b/P5/Source/Guidelines/en/Images/dependency1.png differ
diff --git a/P5/Source/Guidelines/en/Images/dependency2.png b/P5/Source/Guidelines/en/Images/dependency2.png
new file mode 100644
index 0000000000..d0907cdc7f
Binary files /dev/null and b/P5/Source/Guidelines/en/Images/dependency2.png differ
diff --git a/P5/Source/Guidelines/en/Images/dependency3.png b/P5/Source/Guidelines/en/Images/dependency3.png
new file mode 100644
index 0000000000..5a5be0db7a
Binary files /dev/null and b/P5/Source/Guidelines/en/Images/dependency3.png differ
diff --git a/P5/Source/Guidelines/en/Images/dependency4.png b/P5/Source/Guidelines/en/Images/dependency4.png
new file mode 100644
index 0000000000..f8c789c50f
Binary files /dev/null and b/P5/Source/Guidelines/en/Images/dependency4.png differ
diff --git a/P5/Source/Specs/att.linguistic.dependency.xml b/P5/Source/Specs/att.linguistic.dependency.xml
new file mode 100644
index 0000000000..7f1c435150
--- /dev/null
+++ b/P5/Source/Specs/att.linguistic.dependency.xml
@@ -0,0 +1,184 @@
+
+
+
+
+ Universal Dependency (UD) annotation of the English sentence Universal Dependency (UD) annotation of the English sentence Universal Dependency (UD) annotation of the English sentence
+ This gapping example shows the difference between basic dependency
+ relations, here with an elided (absent) second occurrence of
+ Universal Dependency (UD) annotation of the German noun phrase This example illustrates the handling of a surface word form like
+ In the UD examples above, a The following table summarises the serialisation choices for UD’s CoNLL-U format in TEI XML.
+
+
+
+
+
+
+
+
+ IDFORMLEMMAUPOSXPOSFEATSHEAD:DEPREL:DEPSSpaceAfter=No in MISC
The syntax of the FEATS and DEPS columns, with
+ pipe-separated list values, can be used as-is for the values of
+
These attributes make it possible to encode simple language corpora and to add a layer of
linguistic information to any tokenized resource. See section
These guidelines provide no semantic basis or suggested
- precedence when both
These guidelines provide no semantic basis or suggested precedence when both