UDTube is a neural morphological analyzer based on PyTorch, Lightning, and Hugging Face transformers.
Named in homage to the venerable UDPipe, UDTube is focused on incremental inference, allowing it to be used to label large text collections.
The UDTube model consists of a pre-trained (and possibly, fine-tuned) transformer encoder which feeds into a classifier layer with many as four heads handling the different morphological tasks.
Lightning is used to generate the training, validation, inference, and evaluation loops. The LightningCLI interface is used to provide a user interface and manage configuration.
Below, we use YAML to specify configuration options, and we strongly recommend users do the same. However, most configuration options can also be specified using POSIX-style command-line flags.
To install UDTube and its dependencies, run the following command:
pip install .
UDTube uses YAML configuration files; see the example configuration files for examples.
UDTube operates on CoNLL-U
files. This is a 10-column tab-separated format with a blank line between each
sentence and # used for comments. In all cases, the ID and FORM field must
be fully populated; the _ blank tag can be used for unknown fields.
Many of our experiments are performed using CoNLL-U data from the Universal Dependencies project.
UDTube can perform up to four morphological tasks simultaneously:
-
Lemmatization is performed using the
LEMMAfield and edit scripts. -
Universal part-of-speech tagging is performed using the
UPOSfield: enable withdata: use_upos: true. -
Language-specific part-of-speech tagging is performed using the
XPOSfield: enable withdata: use_xpos: true. -
Morphological feature tagging is performed using the
FEATSfield: enable withdata: use_feats: true.
The following caveats apply:
- Note that many newer Universal Dependencies datasets do not have language-specific part-of-speech-tags.
- The
FEATSfield is treated as a single unit and is not segmented in any way. - One can convert from Universal Dependencies morphological
features to UniMorph
features using
scripts/convert_to_um.py. - UDTube does not perform dependency parsing at present, so the
HEAD,DEPREL, andDEPSfields are ignored and should be specified as_.
The udtube command-line tool uses a subcommand interface, with the four
following modes. To see the full set of options available with each subcommand,
use the --print_config flag. For example:
udtube fit --print_config
will show all configuration options (and their default values) for the fit
subcommand.
In fit mode, one trains a UDTube model from scratch. Naturally, most
configuration options need to be set at training time. E.g., it is not possible
to switch between different pre-trained encoders or enable new tasks after
training.
This mode is invoked using the fit subcommand, like so:
udtube fit --config path/to/config.yaml
Setting the seed_everything: argument to some value ensures a reproducible
experiment.
The encoder layer consists of a pre-trained BERT-style transformer model. By
default, UDTube uses multilingual cased BERT
(model: encoder: google-bert/bert-base-multilingual-cased). In theory, UDTube
can use any Hugging Face pre-trained encoder so long as it provides a
AutoTokenizer and has been exposed to the target language. We list all the
Hugging Face encoders we have tested thus far, and warn
users when selecting an untested encoder. Since there is no standard for
referring to the between-layer dropout probability parameter, it is in some
cases also necessary to specify what this argument is called for a given model.
We welcome pull requests from users who successfully make use of encoders not
listed here.
So-called "tokenizer-free" pre-trained encoders like ByT5 are not currently
supported as they lack an AutoTokenizer.
The classifier layer contains up to four sequential linear heads for the four tasks described above. By default all four are enabled.
UDTube uses separate optimizers and LR schedulers for the encoder and classifier. The intuition behind this is that we may wish to make slow, small changes (or possibly, no changes at all) to the pre-trained encoder, whereas we wish to make more rapid and larger changes to the classifier.
The following YAML snippet shows a simple configuration that encapsulates this principle. It uses the Adam optimizer for both encoder and classifier, but uses a lower learning rate for the encoder with a linear warm-up and a higher learning rate for the classifier.
...
model:
encoder_optimizer:
class_path: yoyodyne.optimizers.Adam
init_args:
lr: 1e-5
encoder_scheduler:
class_path: udtube.schedulers.WarmupInverseSquareRoot
init_args:
warmup_epochs: 5
classifier_optimizer:
class_path: torch.optim.Adam
init_args:
lr: 1e-3
classifier_scheduler:
class_path: lightning.pytorch.cli.ReduceLROnPlateau
init_args:
monitor: val_loss
factor: 0.1
...
The default scheduler is yoyodyne.schedulers.Dummy, which keeps learning rate
fixed to its initial value.
A checkpoint config must be specified or no checkpoints will be generated; see here for more information.
See here for more information.
See here for more information.
By default, UDTube attempts to model all four tasks; one can disable the
language-specific tagging task using model: use_xpos: false, and so on.
Dropout probability is specified using model: dropout: ....
The encoder has multiple layers. The input to the classifier consists of just
the last few layers mean-pooled together. The number of layers used for
mean-pooling is specified using model: pooling_layers: ....
By default, lemmatization uses reverse-edit scripts. This is appropriate for
predominantly suffixal languages, which are thought to represent the majority of
the world's languages. If working with a predominantly prefixal language,
disable this with model: reverse_edits: false.
The following YAML snippet shows the default architectural arguments.
...
model:
dropout: 0.5
encoder: google-bert/bert-base-multilingual-cased
pooling_layers: 4
reverse_edits: true
use_upos: true
use_xpos: true
use_lemma: true
use_feats: true
...
Batch size is specified using data: batch_size: ... and defaults to 32.
There are a number of ways to specify how long a model should train for. For example, the following YAML snippet specifies that training should run for 100 epochs or 6 wall-clock hours, whichever comes first.
...
trainer:
max_epochs: 100
max_time: 00:06:00:00
...
In validation mode, one runs the validation step over labeled validation data
(specified as data: val: path/to/validation.conllu) using a previously trained
checkpoint (--ckpt_path path/to/checkpoint.ckpt from the command line),
recording total loss and per-task accuracies. In practice this is mostly useful
for debugging.
This mode is invoked using the validate subcommand, like so:
udtube validate --config path/to/config.yaml --ckpt_path path/to/checkpoint.ckpt
In test mode, we compute accuracy over held-out test data (specified as
data: test: path/to/test.conllu) using a previously trained checkpoint
(--ckpt_path path/to/checkpoint.ckpt from the command line); it differs from
validation mode in that it uses the test file rather than the val file and
it does not compute loss.
This mode is invoked using the test subcommand, like so:
udtube test --config path/to/config.yaml --ckpt_path path/to/checkpoint.ckpt
In predict mode, a previously trained model checkpoint
(--ckpt_path path/to/checkpoint.ckpt from the command line) is used to label a
CoNLL-U file. One must also specify the path where the predictions will be
written.
...
prediction:
path: /Users/Shinji/predictions.conllu
...
Here are some additional details:
- In
predictmode UDTube loads the file to be labeled incrementally (i.e., one sentence at a time) so this can be used with very large files. - In
predictmode, if no path for the predictions is specified, stdout will be used. If using this in conjunction with > or |, add--trainer.enable_progress_bar falseon the command line. - The target task fields are overriden if their heads are active.
- Use
scripts/pretokenize.pyto convert raw text files to CoNLL-U input files.
This mode is invoked using the predict subcommand, like so:
udtube predict --config path/to/config.yaml --ckpt_path path/to/checkpoint.ckpt
See examples for some worked examples including
hyperparameter sweeping with Weights & Biases.
See scripts/README.md for details on provided scripts not
mention above.
UDTube is distributed under an Apache 2.0 license.
We welcome contributions using the fork-and-pull model.
A large number of tests are provided. To run all tests, run the following:
pytest -vvv tests
Tests in tests/udtube_test.py are heavy-weight
integration tests and exceed the resources of our current continuous integration
framework. Therefore one is encouraged to run these locally before submitting a
PR.
See the pytest
documentation for more
information on the test runner.
If you use UDTube in your research, we would appreciate it if you cited the following document, which describes the model:
Yakubov, D. 2024. How do we learn what we cannot say? Master's thesis, CUNY Graduate Center.