This program takes a dictionary written in a custom file format and converts it to JSON; the file format is intended to make writing dictionary entries as easy and terse as possible without compromising flexibility.
This is a Rust project, so install Rust on your system and build it using cargo:
$ cargo build --releaseThis creates a generator binary in target/release/generator. You can then invoke the program by passing it a file and it will print the generated dictionary to stdout:
$ ./target/release/generator my_dictionary.dict.txtIn addition to a standalone executable, the generator proper is a standalone crate that can be consumed as a library; the implementation of the generator binary in src/main.rs serves as a good example that demonstrates the API of the library.
Lastly, this repository also bundles a WASM plugin for use with Typst. This enables adding dictionaries to a Typst project witout requiring a separate build step. Our Typst template library that we use for all of our project includes bindings for it. Just search for instantiate-dictionary-plugin.
If you don't want to depend on our library, you can build the plugin yourself; simply run:
$ cargo build --release --target wasm32-unknown-unknown --libThe compiled plugin can then be found in target/wasm32-unknown-unknown/release/dictgen.wasm. The functions that are exposed by the plugin can be found in src/wasm.rs.
If you’re using vscode, an extension that provides syntax highlighting can be found in vscode-plugin; you can build it by running vsce package; this creates a dictionary-file-format-<version>.vsix file; you can right-click on that in vscode and then select ‘Install Extension VSIX’ at the bottom of the drop-down to install it.
The following is an exhaustive description of the dictionary file format; a dictionary file consists of 2 sections:
- an optional preamble containing directives which define text-to-text transformations for e.g. converting text in the source language to IPA;
- the actual dictionary entries; these can be in any order as the generator will sort them alphabetically.
In either part, any text in a line after a # is ignored in most circumstances.
The preamble consists of any number of the directives described below; all directives start with a $ sign which must be the very first character in a line (including whitespace).
The $ipa directive defines a set of transformations for converting text in the language described by the dictionary to IPA. The syntax of the directive is:
$ipa {
<transformations>
}
All supported transformations are described in the ‘String Replacement Ops’ section below.
The $preprocess directive runs string-to-string transformations on the parts of a full entry (see the ‘Full Entry’ section below) before they are parsed. The syntax of this directive is:
$preprocess {
<part> { <transformations> }
# ...
}
Here <part> must be one of:
pos(part of speech),etym(etymology),def(definition),forms,ipa.
The <transformations> are then applied to the contents of the <part>; a single $preprocess
directive may contain multiple parts. All supported transformations are described in the ‘String Replacement Ops’ section below.
The $collate directive can be used to alter the collation, i.e. sort order, of dictionary entries. This directive is of the form
$collate {
by "..."
preprocess {
<transformations>
}
}
Both the by and preprocess clauses are optional; the contents of the <preprocess> clause
are string transformations; all supported transformations are described in the ‘String Replacement Ops’ section below. The by clause consists of a list of characters enclosed in "".
The effect of this clause is as follows: if a preprocess clause is present, the transformations therein are applied to each lemma. If a by clause is present, every character in the lemma that does not appear in the by clause is deleted, and the sort order of the words is based of the order of characters in the by clause.
For example, the following collation
$collate {
by "tmoae"
preprocess { lower }
}
has the following effect: First, the lemma is converted to lowercase; then, all characters in it that are not t, m, o, a, or e are deleted. Finally, words are sorted according to that order; e.g. if we have the 3 entries ta, mo, to in the dictionary, they will be sorted as to, ta, mo.
Note that the preprocess directive is only used for sorting. It does not affect the final lemmas that end up in the output. This means you can use it to simplify the words quite a bit by deleting ‘unnecessary’ parts that aren’t needed for sorting.
The generator will emit an error if a collation ends up deleting all characters in a word. If you don’t specify a collation, the default collation is:
$collate {
preprocess {
nfkd
remove_punct
nfc
lower
}
}
The $declare directive declares a custom macro; to combat typos, the generator errors if it encounters a macro it doesn’t recognise (see the ‘Macros’ section below). This directive provides a way to tell a generator that a certain macro should be accepted. Its syntax is
$declare <name> <args>
where <name> is the name of a macro without the \, and <args> is a non-negative integer that describes how many arguments the macro takes; e.g.
$declare foo 2
declares a macro \foo that takes 2 arguments. That is, the generator will then accept e.g. \foo{bar}{baz}, but not \foo, \foo{bar}, or \foo{bar}{baz}{quux}.
Note that this does not define what the macro actually does! The generator will emit a custom_macro node that you will then have to handle wherever you actually make use of the JSON output. The API of our Typst library allows you to define a custom-macro-handler for this purpose.
This section describes the string-to-string transformations that can be used in various parts of the preamble. Wherever <transformations> appears above, you can insert any number of these transformations; they will be applied in order of appearance.
All transformations are Unicode-aware.
This transformation is only valid within an $ipa directive and has the following syntax:
lemma {
<transformations>
}
Its function is to specify transformations that are only applied if we’re converting a lemma in the dictionary to IPA. They are not applied whenever standalone IPA conversion is requested, e.g. via the generator’s --convert flag, by calling Generator::to_ipa(), or by using the Typst library’s to_ipa() function.
This can be useful if you want to e.g. drop everything after a comma when converting a lemma to IPA, but you also want to be able to pass an entire text (containing commas) to the generator to convert it to IPA w/o it stripping everything after the first comma.
lemma directives cannot be nestead as that would not be useful.
This transformation has the following syntax:
lower
It converts the input string to lowercase.
These transformations have the following syntax:
nfc
Or instead of nfc, you can write nfd, nfkc, or nfkd. These transformations convert the input into the corresponding Unicode normalisation form.
This transformation is only valid within the $preprocess directive and has the following syntax:
m/<regex>/ "message"
!m/<regex>/ "message" # negated version
This is not really a transformation; rather it matches the input against the regular expression <regex>; if it doesn’t match, an error containing "message" is printed. The m may be preceded by !, which negates the match, i.e. the error is emitted if the input does match.
This can be used to verify that the input is well-formed. For example, given
$preprocess {
etym { !m/b/ "We don’t allow 'b's here!" }
}
the generator will print the error "We don’t allow 'b's here!" if the etymology field of a full entry contains a b character.
The syntax used is that of Rust’s fancy-regex crate, which essentially supports Oniguruma regular expressions. If you’re not familiar with regexes, I suggest starting with the Wikipedia page.
This transformation has the following syntax:
remove_punct
It deletes ‘punctuation’ from the input (including diacritics); specifically, this removes any characters in any of the following Unicode categories: Mc, Me, Mn, Pc, Pd, Pe, Pf, Pi, Po, Ps.
This transformation has the following syntax:
s/<regex>/<replacement>/
Instead of /, you can also use | or %, but you can’t mix them, e.g. s|a|b| is vaid, but s/a/b| is not.
This transformation replaces every occurrence of the regular expression <regex> with <replacement>. The syntax used is that of Rust’s fancy-regex crate, which essentially supports Oniguruma regular expressions. If you’re not familiar with regexes, I suggest starting with the Wikipedia page.
This transformation is quite a bit more complicated and has the following syntax:
trie (<norm>) {
<pattern> => <replacement>
<pattern> => <replacement>
# ... more patterns
}
As for the individual components:
-
<norm>is a Unicode normalisation form, i.e. eithernfc,nfd,nfkc, ornfkd. You can also omit this entirely, in which case no normalisation is performed. -
<replacement>is a sequence of characters; if it is a single*, then the pattern is deleted. -
<pattern>is not a regular expression, even though its syntax is similar to one: it is either- a set of characters enclosed in square brackets, e.g.
[abc]; sets may contain whitespace, e.g.[ ]; if you want to include\or]in a set, write\\or\]instead, e.g.[\]]; or - a sequence of characters, e.g.
abc; you can also combine multiple sequences in one line using|, e.g.abc|def => xis equivalent to writing bothabc => xanddef => x.
You cannot combine a sequence and a set at the moment; this may be supported in the future.
- a set of characters enclosed in square brackets, e.g.
In both the <replacement> and <pattern>, you can specify any Unicode character by writing \x{<code>} where <code> is a hexadecimal Unicode code point, e.g. \x{00DE} would be Þ.
This clause performs text replacement: First, <norm> is applied to both the input text and all <pattern>s. Then, each <pattern> in the input text is replaced with its <replacement>. All patterns are matched and replaced simultaneously in a single pass:
- If
<pattern>is a set, e.g.[abc], each individual character in the set is replaced; e.g. if we have[abc] => foo, every occurrence ofa,b, orcin the input is replaced withfoo. - If pattern is a sequence, e.g.
abc, that entire sequence is replaced. If multiple sequences would match, the longest one is replaced, e.g. if you specifyab => xandabc => y, anyabcsequence in the input will be replaced withy. - Because this happens in a single pass, there is no iterative replacement, e.g. if you write
a => bandb => c, then an input ofabwould becomebc, notcc.
There are 2 types of entries: full entries and reference entries. By default, each non-empty line in a dictionary file (that is not part of a directive) starts a new entry. To break an entry across multiple lines, indent the continuation lines with at least one a space or tab.
Line-breaking does not affect how the text is displayed. Moreover, any sequence of whitespace characters is replaced with a single space, and whitespace at the start and end of a field in a full entry or word in a reference entry is removed.
As in the preamble, anything after the first # in a line is ignored. Empty lines are ignored as well; in particular, you can have an empty line between two continuation lines without starting a new entry.
A full entry consists of multiple fields, separated by |. Thus, fields cannot contain a | character. The fields are, in order:
- The lemma.
- The part of speech and other initial annotations.
- The etymology of the word.
- The definition of the word. This may consist of a primary sense and several additional senses (see below).
- (optional) Grammatical forms of the word.
- (optional) The pronunciation of the word in IPA. Only provide this if it is irregular; if the IPA can be derived from the spelling, prefer to use an
$ipadirective instead. If this field is provided, it overrides the$ipadirective for this entry.
Any field other than the lemma and definition may be empty. Thus, e.g. a|||b would be a valid entry (unless a particular dictionary chooses to disallow this). By contrast e.g. a||b is not a valid entry since the definition is missing, and neither is a||b|.
The definition field is the most complex: It may contain multiple senses; use \\ to start a new sense. The primary definition is everything before the first sense. A sense, as well as the primary definition, may have a single \comment as well as any number of examples that start with \ex; each example may also have a single \comment:
\\ sense 1
\comment foo
\ex example 1
\comment comment for example 1
\ex example 2
\comment comment for example 2
The indentation depth and line breaks here are optional and only for visual clarity, i.e. this would be equivalent:
\\sense 1\comment foo\ex example 1\comment comment for example 1\ex example 2\comment comment for example 2
The generator will insert a full stop . at the end of every sense (including the primary definition), comment, and example, provided that it doesn’t already end with a punctuation mark. Thus, e.g. \comment foo. is equivalent to writing \comment foo
NOTE: \\, \comment, and \ex are allowed in this field only. Crucially, they do not take an argument, i.e. write \comment This is a comment rather than \comment{This is a comment}.
Macros are used to add markup to an entry. For historical reasons, the syntax of macros is heavily inspired by that of LaTeX macros: a macro name consists of a backslash followed immediately by either one or more letters (e.g. \Sup) or a a single special character (e.g. \$). Macros are case-sensitive. A macro may also take arguments that follow the macro name and are wrapped in braces {}. E.g. the \s macro takes a single argument (e.g. \s{acc}) and formats that argument in small-caps. Braces anywhere else in the input must be balanced but are otherwise ignored.
The following is an exhaustive list of builtin macros:
\-: a soft hyphen; this indicates to the typesetting engine that it may break the word here.\: a literal space; by default, multiple spaces in a row are collapsed to a single space.\&: the character&.\$: the character$. Without the\, a$instead starts math mode.\%: the character%.\%: the character%. You can also just write%on its own. It used to have special meaning but doesn’t anymore.\#: the character#. Without the\, a#instead causes the rest of the line to be ignored.\{: the character{. On its own, a{either starts a macro argument or is ignored when not preceded by a macro or another argument.\}: the character}. On its own, this either ends a macro argument or is ignored.\\: start a new sense; only valid within the definition field; see ‘Definition’ above.\b{arg}: typesetargin bold.\i{arg}: typesetargin italics.\comment: start a comment; only valid within the definition field; see ‘Definition’ above.\ex: start an example; only valid within the definition field; see ‘Definition’ above.\ldots: insert an ellipsis:….\par: insert a paragraph break; only use this if an entry is really long.\ref{label}: insert a reference tolabel; e.g. to reference a Typst label<foo>, write\ref{foo}.\s{arg}: typesetargin small-caps.\senseref{n}: reference the n-th sense of this entry; prefer e.g. ‘see\sensref{1}’ over writing ‘see sense 1’.\Sub{arg}: typesetargin subscript.\Sup{arg}: typesetargin superscript.\textnf{arg}: typesetargas ‘normal’ text, i.e. not bold/italic/small-caps.\this: insert the lemma of this entry.\w{arg}: typesetargin the same way as a lemma; use this to refer to other lemmas or words in an entry.\x{codepoint}: insert the Unicode character with valuecodepoint, e.g.\x{00DE}inserts aÞ.
Macros can generally be nested, but whether this makes sense depends on the macro, e.g. \b{\i{arg}} typesets arg in bold italic, whereas \b{\textnf{arg}} is a bit pointless, as the \textnf just undoes the effect of the \b, and \x{\b{1234}} is simply an error as \b{1234} is not a valid Unicode codepoint.
An unknown macro causes an error unless you $declared it in the preamble (see ‘$declare Directive’ above).
An entry with the fields highlighted:
dír|v. tr.|dire|+\s{acc} To say, tell (+\s{dat} someone) |\s{fut} dírẹ́, \s{subj} díss
^^^ ^^^^^^ ^^^^ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ^^^^^^^^^^^^^^^^^^^^^^^^^^^
1 2 3 4 5
An entry with 3 senses and no primary definition, as well as a forms field at the very end:
syl|v.|seul|
\\ To be the only one.
\\ \i{(of multiple things)} To be one, a united whole.
\\ To be lone, alone|\s{fut} syle, \s{subj} syls
This entry has the primary definition of ‘fish’, a comment on the primary definition, as well as two examples, each with their own comment:
ráhó|n.|poisson|Fish
\comment UF has a series of proverbs around fish drowning (in water!),
despite the fact that fish quite literally breathe water and therefore
are incapable of ‘drowning’.
\ex \w{Láráhó slẹlúrá.} Now you’ve done it.
\comment Literally ‘the fish was too bulky [to swim to the surface,
so it drowned]’
\ex \w{Áhaúr’sý’ýâ láráhó sráy’éá.} There is more to this. There is a
story behind this.
\comment Literally ‘the fish hasn’t drowned yet’.
A reference entry consists of a number of comma-separated words, followed by > and another word; e.g.
foo, bar > baz
creates one entry for foo and bar each, both of which point to baz. This is useful for pointing irregular forms towards the base entry that describes them, e.g. in an English dictionary, one might write:
are, is, was, were, been > be
The character | cannot be used in a reference entry.