GitHub - ngbolin/Jap2Eng: Japanese to English neural machine translation models

Neural Machine Translation Model - Jap2Eng

Introduction

Simple implementation of Japanese-to-English neural machine translation models. The initial codes are heavily lifted from Assignment 3 of Stanford's CS224N course on Natural Language Processing with Deep Learning, "Neural Machine Translation". I'm training these models locally on my RTX 5070 GPU.

Data

For the Japanese-to-English model, our data comes from the Japanese-English Subtitle Corpus. The corpus contains 2.8 million sentences, with 2000 dev and 2000 test sentences for evaluation. The data provides data for both English and Japanese sentences separately, so no further processing is required on our end after data download here.

For pre-training, we are using data from the JParaCrawl dataset, the largest publicly available English-Japanese parallel corpus created by NTT.

Tokenization

For tokenization, I opted for SentencePiece from Google (similar to Assignment 3) which treats each sentence as a sequence of Unicode characters; there is no language-dependent logic. This is applied across both the Japanese and English corpus to tokenize the sentences into tokens (via words).

Model (Updated 4 Mar 2026)

The weights and parameters of the best model (TransformerA) has been uploaded on HuggingFace (https://huggingface.co/ngbolin/Jap2Eng).

Evaluation Metric

To evaluate our model, we will use BLEURT, a Transfer Learning-Based Metric for Natural Language Generation. BLEURT takes a sentence pair (reference and candidate) and returns a score that indicates the extent to which the candidate is fluent and conveys the meaning of the reference. We will use the recommended checkpoint of BLEURT-20, which returns a score between 0 and 1 where 0 indicates a random output and 1 a perfect one. Following Google Research's recommendation, we will average the BLEURT scores across the sentences in the corpus.

Model

Bidirectional LSTM with Attention

We are using a series of different models to show how model performance varies according to its architecture. In the first model, we utilised a 2-layer Bidirectional LSTM with Attention, similar to Luong, Pham and Manning (2015). I've selected a batch size of 32 for the LSTM, along with a word embedding size of 300 (similar to Word2Vec). In addition, the number of hidden units for the neural layers are set at 512.

Hyperparameters for the Bidirectional LSTM with Attention

batch_size: 32
word_embeddings: 300 (per word, per language)
hidden_units: 256 (for the LSTM layers)
dropout_rate: 0.2
learning_rate: 5e-4
num_trials: 5

Transformer B(asic)

In our benchmark transformer model, we will use Transformers ala Vaswani et al.. Due to my GPU constraint, I've elected for a smaller batch size (32 instead of the usual 64), and kept the number of encoder and decoder layers constant (at 6). In this model, normalisation is conducted first, prior to residual connection i.e. "Norm and Add" instead of "Add and Norm". Each decoder block contains the following: (Self-attention mechanism, Norm and Add layer, Feed Forward Neural Network with size 2048, and another Norm and Add layer). While our word vector embedding size is relatively small (at 512), we still employ Scaled Dot-Product Attention for normalization.

Hyperparameters for the Transformer Model B(asic)

batch_size: 32
word_embeddings: 512
dim_feedforward: 2048
nhead: 8
num_encoder_layer: 6
num_decoder_layer: 6
dropout_rate: 0.1
learning_rate: 3e-4
num_trials: 5

Transformer A(dvanced)

In our second model, we pre-train our Transformer on a separate Japanese to English dataset For pre-training, we will use data from JParaCrawl, the largest publicly available English-Japanese parallel corpus created by NTT. We create a function, preprocess.py to extract the first 2 million sentence-pairs, and allocate 99.9% of them for training (and 0.1% for dev). Pre-training on a dataset that is different from the JESC should in theory allow our Transformer model to generalize better and faster, and setting the parameter num_trials to 1 prevents the Pretrained Transformer from being overly fixated on the Pretraining dataset. Apart from the use of the Pretrained Transformer, the hyperparameters for TransformerA and TransformerB are largely similar.

Hyperparameters for Pretrained Transformer

batch_size: 16
word_embeddings: 512
dim_feedforward: 2048
nhead: 6
num_encoder_layer: 6
num_decoder_layer: 6
dropout_rate: 0.1
learning_rate: 3e-4
num_trials: 1

Hyperparameters for the Transformer Model A(dvanced)

batch_size: 32
word_embeddings: 512
dim_feedforward: 2048
nhead: 6
num_encoder_layer: 6
num_decoder_layer: 6
dropout_rate: 0.1
learning_rate: 3e-4
num_trials: 5

How To

Bidirectional LSTM and Transformer B(asic)

The steps for all models are (largely similar) apart from the arguments listed. In the Bidirectional LSTM with Attention, the argument hidden_units is made available for the number of hidden units for each word. On the other hand, the parameter nheads is made available for the Transformer models due to the usage of Multihead Attention.

This generates the following files: (1) vocab.json (file containing the word2idx and idx2word dictionaries), (2) src.vocab and tgt.vocab files which functions as the lookup table for our Translation model to extract the relevant tokens/ids and (3) src.model and tgt.model, the tokenizer models that splits Japanese and English terms.

python vocab.json --train-src=../data/jpn-eng/JESC/train.ja --train-tgt=../data/jpn-eng/JESC/train.ja vocab.json

This trains the models (in [nmt_model.py]), using the parameters e.g. embedding size, nheads, dropout rate, batch size listed in [run.sh]. Where parameters are not made explicitly available, you may refer to the raw code in [nmt_model.py] to adjust accordingly.

sh run.sh train

Decodes the test input into test output and evaluates the goodness of fit of our test outputs with the actual output using BLEU.

sh run.sh test

Bidirectional LSTM and Transformer A(dvanced)

For pre-training, we will use data from JParaCrawl, the largest publicly available English-Japanese parallel corpus created by NTT. The specific version we are using is V2.0, which contains 10.0 million sentence pairs. We have created a function, called preprocess.py that processes and creates 2 different datasets for each language (training and dev).

python preprocess.py

The following function generates the following files: (1) vocab.json (file containing the word2idx and idx2word dictionaries), (2) src.vocab and tgt.vocab files which functions as the lookup table for our Translation model to extract the relevant tokens/ids and (3) src.model and tgt.model, the tokenizer models that splits Japanese and English terms.

python vocab.json --train-src=../data/jpn-eng/JESC/train.ja --train-tgt=../data/jpn-eng/JESC/train.ja vocab.json

After data pre-processing, we proceed to pretrain the model. This cal be done by calling the following function: 3. sh run.sh pretrain

This trains the models (in [nmt_model.py]), using the parameters e.g. embedding size, nheads, dropout rate, batch size listed in [run.sh]. Where parameters are not made explicitly available, you may refer to the raw code in [nmt_model.py] to adjust accordingly.

sh run.sh train

Decodes the test input into test output and evaluates the goodness of fit of our test outputs with the actual output using BLEU.

sh run.sh test

Results

Using our Bidirectional LSTM, we obtained a BLEURT of 0.414 on the holdout dataset. Our Transformer B(asic) achieved a BLEURT of 0.477. On the other hand, our Transformer A(dvanced) achieved a BLEURT of 0.480.

Conclusion

In the case of our simple example, we observe that Transformers do much better than Bidirectional LSTMs with Attention, with both Transformer models achieving an improvement of > 15%. However, pretraining doesn't seem to improve the BLEURT score much, since it only leads to an improvement of 0.6%.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
.ipynb_checkpoints		.ipynb_checkpoints
BidirectionalLSTM		BidirectionalLSTM
HuggingFace		HuggingFace
TBT		TBT
TransformerA		TransformerA
TransformerB		TransformerB
data/jpn-eng/JESC		data/jpn-eng/JESC
.gitattributes		.gitattributes
.gitignore		.gitignore
README.md		README.md
env-gpu.yml		env-gpu.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Neural Machine Translation Model - Jap2Eng

Introduction

Data

Tokenization

Model (Updated 4 Mar 2026)

Evaluation Metric

Model

Bidirectional LSTM with Attention

Hyperparameters for the Bidirectional LSTM with Attention

Transformer B(asic)

Hyperparameters for the Transformer Model B(asic)

Transformer A(dvanced)

Hyperparameters for Pretrained Transformer

Hyperparameters for the Transformer Model A(dvanced)

How To

Bidirectional LSTM and Transformer B(asic)

Bidirectional LSTM and Transformer A(dvanced)

Results

Conclusion

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Neural Machine Translation Model - Jap2Eng

Introduction

Data

Tokenization

Model (Updated 4 Mar 2026)

Evaluation Metric

Model

Bidirectional LSTM with Attention

Hyperparameters for the Bidirectional LSTM with Attention

Transformer B(asic)

Hyperparameters for the Transformer Model B(asic)

Transformer A(dvanced)

Hyperparameters for Pretrained Transformer

Hyperparameters for the Transformer Model A(dvanced)

How To

Bidirectional LSTM and Transformer B(asic)

Bidirectional LSTM and Transformer A(dvanced)

Results

Conclusion

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages