Skip to content

Language

Christian Lück edited this page Jan 5, 2022 · 1 revision

Language and script direction

According to the TEI guidelines, the writing direction of a script should be encoded by the @xml:lang attribute. Moreover, the languages, that are used in a TEI encoded document, should be listed in teiHeader/profileDesc/langUsage.

Here is an example header:

  <profileDesc xml:id="profileDesc">
	 <langUsage xml:id="langUsage">
		<language ident="ar">Arabisch</language>
		<language ident="ar-Latn">Arabisch in Umschrift nach Brockelmann/Wehr</language>
		<language ident="de">Deutsch</language>
		<language ident="en">Englisch</language>
	 </langUsage>
  </profileDesc>

The framework offers a function for setting the @xml:lang attribute by selecting a language from the list of languages in the header.

  • Change language author mode action
    • is available in the Toolbar: languageicon (Note: The icon was desigend by Onur Mustak Cobanli an is distributed on http://languageicon.org/ by under a CC licence with Relax-Attribution term.)
    • is available through content completion (Return)
    • is available in the TEI P5 menu
  • content completion is active in text mode

CSS

In order to get nice rendering in author mode, you should provide CSS for the used languages through the project specific CSS file. Here is an example:

@namespace xml "http://www.w3.org/XML/1998/namespace";

[xml|lang="ar"] {
    direction: rtl !important;
}

[xml|lang="de"] {
    direction: ltr !important;
}

[xml|lang="en"] {
    direction: ltr !important;
}

[xml|lang="ar-Latn"] {
    direction: ltr !important;
}

Structure of language codes

Do you wonder how to get language codes right? Are ar-Latn, he-Arab, got-Goth, DE-de-zyyy syntactically correct? (Yes, they all are!)

The specification is in RFC 5646.

The important part is that part of the grammar in bnf:

 langtag       = language
                 ["-" script]
                 ["-" region]
                 *("-" variant)
                 *("-" extension)
                 ["-" privateuse]

 language      = 2*3ALPHA            ; shortest ISO 639 code
                 ["-" extlang]       ; sometimes followed by
                                     ; extended language subtags
               / 4ALPHA              ; or reserved for future use
               / 5*8ALPHA            ; or registered language subtag

 extlang       = 3ALPHA              ; selected ISO 639 codes
                 *2("-" 3ALPHA)      ; permanently reserved

 script        = 4ALPHA              ; ISO 15924 code

 region        = 2ALPHA              ; ISO 3166-1 code
               / 3DIGIT              ; UN M.49 code

 ...

So to be sure, that he-Arab -- hebrew language in arabic script -- is syntactically correct, you will need at least two other documents:

Bidirectional embeddings and overrides

I recommend not to use Unicode codes for bidirectional embeddings or overrides in the TEI source file. Reason: They are not visible directly and though it can easily happen that things get broken. These codes impose a embedded context free grammar (CFG) own their own (like parenthesis expressions). To me, it seems to be a better approach to place pairs of them into CSS or generated HTML, where it is much easier to assert, that there are always pairs of them in use. That's how we go in our edition of arabic poems.

Clone this wiki locally