Skip to content

Support refsDecl/citeStructure for segmentation #364

@cmil

Description

@cmil

The current segmentation algorithm (for the purpose of extracting co-occurrence networks) acknowledges the markup conventions of some historic corpora and otherwise defaults to div elements that have direct sp children or are explicitly marked as @type "scene".

dracor-api/modules/util.xqm

Lines 312 to 328 in f3abc6c

declare function dutil:get-segments ($tei as element()*) as element()* {
if(not($tei//tei:body//(tei:div|tei:div1))) then
(: missing segmentation :)
$tei//tei:body
else if($tei//tei:body//tei:div2[@type="scene"]) then
(: romdracor :)
(: plautus-trinummus has the prologue coded as div1 which is why we
: recognize div1 without div2 children as segment
:)
$tei//tei:body//(tei:div2[@type="scene"]|tei:div1[tei:sp and not(tei:div2)])
else if ($tei//tei:body//tei:div1) then
(: greekdracor :)
$tei//tei:body//tei:div1
else
(: for all others we rely on divs having sp children :)
$tei//tei:body//tei:div[tei:sp or (@type="scene" and not(.//tei:sp))]
};

While this has worked for most corpora, some use div elements more extensively so that the current heuristic would not be sufficient to correctly identify segments or scenes. For these cases we could support the citeStructure element in the refsDecl.

For instance, the following declaration would make sure that only div elements with a @type "scene" are considered segments:

<encodingDesc>
  <refsDecl>
    <citeStructure unit="scene" match="/tei:TEI//tei:body//tei:div[@type='scene']" use="@n"/>
  </refsDecl>
</encodingDesc>

For (the rebooted) GreekDraCor (see dracor-org/greekdracor#22) this declaration could be used:

<encodingDesc>
  <refsDecl>
    <citeStructure
      unit="scene"
      match="/tei:TEI//tei:body//tei:div[@type='textpart' and @subtype =('episode', 'choral')]"
      use="position()" />
  </refsDecl>
</encodingDesc>

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions