The current segmentation algorithm (for the purpose of extracting co-occurrence networks) acknowledges the markup conventions of some historic corpora and otherwise defaults to div elements that have direct sp children or are explicitly marked as @type "scene".
|
declare function dutil:get-segments ($tei as element()*) as element()* { |
|
if(not($tei//tei:body//(tei:div|tei:div1))) then |
|
(: missing segmentation :) |
|
$tei//tei:body |
|
else if($tei//tei:body//tei:div2[@type="scene"]) then |
|
(: romdracor :) |
|
(: plautus-trinummus has the prologue coded as div1 which is why we |
|
: recognize div1 without div2 children as segment |
|
:) |
|
$tei//tei:body//(tei:div2[@type="scene"]|tei:div1[tei:sp and not(tei:div2)]) |
|
else if ($tei//tei:body//tei:div1) then |
|
(: greekdracor :) |
|
$tei//tei:body//tei:div1 |
|
else |
|
(: for all others we rely on divs having sp children :) |
|
$tei//tei:body//tei:div[tei:sp or (@type="scene" and not(.//tei:sp))] |
|
}; |
While this has worked for most corpora, some use div elements more extensively so that the current heuristic would not be sufficient to correctly identify segments or scenes. For these cases we could support the citeStructure element in the refsDecl.
For instance, the following declaration would make sure that only div elements with a @type "scene" are considered segments:
<encodingDesc>
<refsDecl>
<citeStructure unit="scene" match="/tei:TEI//tei:body//tei:div[@type='scene']" use="@n"/>
</refsDecl>
</encodingDesc>
For (the rebooted) GreekDraCor (see dracor-org/greekdracor#22) this declaration could be used:
<encodingDesc>
<refsDecl>
<citeStructure
unit="scene"
match="/tei:TEI//tei:body//tei:div[@type='textpart' and @subtype =('episode', 'choral')]"
use="position()" />
</refsDecl>
</encodingDesc>
The current segmentation algorithm (for the purpose of extracting co-occurrence networks) acknowledges the markup conventions of some historic corpora and otherwise defaults to
divelements that have directspchildren or are explicitly marked as@type"scene".dracor-api/modules/util.xqm
Lines 312 to 328 in f3abc6c
While this has worked for most corpora, some use
divelements more extensively so that the current heuristic would not be sufficient to correctly identify segments or scenes. For these cases we could support the citeStructure element in therefsDecl.For instance, the following declaration would make sure that only
divelements with a@type"scene" are considered segments:For (the rebooted) GreekDraCor (see dracor-org/greekdracor#22) this declaration could be used: