Simple implementation of Japanese-to-English neural machine translation models. The initial codes are heavily lifted from Assignment 3 of Stanford's CS224N course on Natural Language Processing with Deep Learning, "Neural Machine Translation". I'm training these models locally on my RTX 5070 GPU.
For the Japanese-to-English model, our data comes from the Japanese-English Subtitle Corpus. The corpus contains 2.8 million sentences, with 2000 dev and 2000 test sentences for evaluation. The data provides data for both English and Japanese sentences separately, so no further processing is required on our end after data download here.
For pre-training, we are using data from the JParaCrawl dataset, the largest publicly available English-Japanese parallel corpus created by NTT.
For tokenization, I opted for SentencePiece from Google (similar to Assignment 3) which treats each sentence as a sequence of Unicode characters; there is no language-dependent logic. This is applied across both the Japanese and English corpus to tokenize the sentences into tokens (via words).
The weights and parameters of the best model (TransformerA) has been uploaded on HuggingFace (https://huggingface.co/ngbolin/Jap2Eng).
To evaluate our model, we will use BLEURT, a Transfer Learning-Based Metric for Natural Language Generation. BLEURT takes a sentence pair (reference and candidate) and returns a score that indicates the extent to which the candidate is fluent and conveys the meaning of the reference. We will use the recommended checkpoint of BLEURT-20, which returns a score between 0 and 1 where 0 indicates a random output and 1 a perfect one. Following Google Research's recommendation, we will average the BLEURT scores across the sentences in the corpus.
We are using a series of different models to show how model performance varies according to its architecture. In the first model, we utilised a 2-layer Bidirectional LSTM with Attention, similar to Luong, Pham and Manning (2015). I've selected a batch size of 32 for the LSTM, along with a word embedding size of 300 (similar to Word2Vec). In addition, the number of hidden units for the neural layers are set at 512.
- batch_size: 32
- word_embeddings: 300 (per word, per language)
- hidden_units: 256 (for the LSTM layers)
- dropout_rate: 0.2
- learning_rate: 5e-4
- num_trials: 5
In our benchmark transformer model, we will use Transformers ala Vaswani et al.. Due to my GPU constraint, I've elected for a smaller batch size (32 instead of the usual 64), and kept the number of encoder and decoder layers constant (at 6). In this model, normalisation is conducted first, prior to residual connection i.e. "Norm and Add" instead of "Add and Norm". Each decoder block contains the following: (Self-attention mechanism, Norm and Add layer, Feed Forward Neural Network with size 2048, and another Norm and Add layer). While our word vector embedding size is relatively small (at 512), we still employ Scaled Dot-Product Attention for normalization.
- batch_size: 32
- word_embeddings: 512
- dim_feedforward: 2048
- nhead: 8
- num_encoder_layer: 6
- num_decoder_layer: 6
- dropout_rate: 0.1
- learning_rate: 3e-4
- num_trials: 5
In our second model, we pre-train our Transformer on a separate Japanese to English dataset For pre-training, we will use data from JParaCrawl, the largest publicly available English-Japanese parallel corpus created by NTT. We create a function, preprocess.py to extract the first 2 million sentence-pairs, and allocate 99.9% of them for training (and 0.1% for dev). Pre-training on a dataset that is different from the JESC should in theory allow our Transformer model to generalize better and faster, and setting the parameter num_trials to 1 prevents the Pretrained Transformer from being overly fixated on the Pretraining dataset. Apart from the use of the Pretrained Transformer, the hyperparameters for TransformerA and TransformerB are largely similar.
- batch_size: 16
- word_embeddings: 512
- dim_feedforward: 2048
- nhead: 6
- num_encoder_layer: 6
- num_decoder_layer: 6
- dropout_rate: 0.1
- learning_rate: 3e-4
- num_trials: 1
- batch_size: 32
- word_embeddings: 512
- dim_feedforward: 2048
- nhead: 6
- num_encoder_layer: 6
- num_decoder_layer: 6
- dropout_rate: 0.1
- learning_rate: 3e-4
- num_trials: 5
The steps for all models are (largely similar) apart from the arguments listed. In the Bidirectional LSTM with Attention, the argument hidden_units is made available for the number of hidden units for each word. On the other hand, the parameter nheads is made available for the Transformer models due to the usage of Multihead Attention.
This generates the following files: (1) vocab.json (file containing the word2idx and idx2word dictionaries), (2) src.vocab and tgt.vocab files which functions as the lookup table for our Translation model to extract the relevant tokens/ids and (3) src.model and tgt.model, the tokenizer models that splits Japanese and English terms.
python vocab.json --train-src=../data/jpn-eng/JESC/train.ja --train-tgt=../data/jpn-eng/JESC/train.ja vocab.json
This trains the models (in [nmt_model.py]), using the parameters e.g. embedding size, nheads, dropout rate, batch size listed in [run.sh]. Where parameters are not made explicitly available, you may refer to the raw code in [nmt_model.py] to adjust accordingly.
sh run.sh train
Decodes the test input into test output and evaluates the goodness of fit of our test outputs with the actual output using BLEU.
sh run.sh test
For pre-training, we will use data from JParaCrawl, the largest publicly available English-Japanese parallel corpus created by NTT. The specific version we are using is V2.0, which contains 10.0 million sentence pairs. We have created a function, called preprocess.py that processes and creates 2 different datasets for each language (training and dev).
python preprocess.py
The following function generates the following files: (1) vocab.json (file containing the word2idx and idx2word dictionaries), (2) src.vocab and tgt.vocab files which functions as the lookup table for our Translation model to extract the relevant tokens/ids and (3) src.model and tgt.model, the tokenizer models that splits Japanese and English terms.
python vocab.json --train-src=../data/jpn-eng/JESC/train.ja --train-tgt=../data/jpn-eng/JESC/train.ja vocab.json
After data pre-processing, we proceed to pretrain the model. This cal be done by calling the following function:
3. sh run.sh pretrain
This trains the models (in [nmt_model.py]), using the parameters e.g. embedding size, nheads, dropout rate, batch size listed in [run.sh]. Where parameters are not made explicitly available, you may refer to the raw code in [nmt_model.py] to adjust accordingly.
sh run.sh train
Decodes the test input into test output and evaluates the goodness of fit of our test outputs with the actual output using BLEU.
sh run.sh test
Using our Bidirectional LSTM, we obtained a BLEURT of 0.414 on the holdout dataset. Our Transformer B(asic) achieved a BLEURT of 0.477. On the other hand, our Transformer A(dvanced) achieved a BLEURT of 0.480.
In the case of our simple example, we observe that Transformers do much better than Bidirectional LSTMs with Attention, with both Transformer models achieving an improvement of > 15%. However, pretraining doesn't seem to improve the BLEURT score much, since it only leads to an improvement of 0.6%.