Burak Suyunu, Muhammed Emin Güre
Department of Computer Engineering, Boğaziçi University
CmpE 58T - Adv. Natural Language Processing (Spring 2021)
- python >= 3.6
- TensorFlow >= 2
Package dependencies are in requirements.txt file.
To install all run pip install -r requirements.txt.
cd src
python mlsum_data_prep.pyYou can directly download the preprocessed Turkish dataset using the following link: https://drive.google.com/drive/folders/1f3Q3OGWX3BIgHu_8PuO8LjszzFwqi_oG?usp=sharing
The CNN/Daily Mail Dataset available online is stored in binary files. However the pipeline to use binary files is inefficient. In this project we use TensorFlows's tfrecords files to store datasets.
Or you can obtain the raw and preprocessed dataset from this link: https://github.com/steph1793/CNN-DailyMail-Bin-To-TFRecords
You can directly download the preprocessed dataset using with this link: https://drive.google.com/drive/folders/1_-GvNvL1DB8t0tjZgjHfm6JbChgw6jG4?usp=sharing
To be able to run the model with pretrained embeddings, you need to generate the embedding matrix of the corresponding pre-trained word embedding method.
To generate the embedding matrices yourself:
- Go to
src/embeddings/ - Choose the method that you want to generate embedding matrix and open the corresponding file (
glove.pyfor pre-trained English GloVe embeddings) - Take required actions and make path changes written as comments if there is any.
3.1. For
nnlm.py | use.py | use-tr.py, you don't need to make any adjustments. - Run the code
python embedding_name.py(python glove.pyfor English GloVe)
You can directly download the Embedding Matirces with this link: https://drive.google.com/drive/folders/18a-keUl5GAAQmZf-i3lUwYv2z8YPh888?usp=sharing
For English Models:
python src/main.py --mode=train --data_dir=/path/to/tfrecords_finished_files/chunked_train --vocab_path=/path/to/tfrecords_finished_files/vocab --checkpoint_dir=/path/to/Checkpoints/embedding_nameFor English Models with Embedding:
python src/main.py --mode=train --data_dir=/path/to/tfrecords_finished_files/chunked_train --vocab_path=/path/to/tfrecords_finished_files/vocab --checkpoint_dir=/path/to/Checkpoints/embedding-name --pt_embedding=/path/to/embeddings/embedding-name_embedding_matrix.pk --embed_size=embedding-dimensionFor Turkish Models:
python src/main.py --mode=train --data_dir=/path/to/mlsum/train.tfrecords --vocab_path=/path/to/mlsum/vocab --checkpoint_dir=/path/to/Checkpoints/embedding_nameFor Turkish Models with Embedding:
python src/main.py --mode=train --data_dir=/path/to/mlsum/train.tfrecords --vocab_path=/path/to/mlsum/vocab --checkpoint_dir=/path/to/Checkpoints/embedding-name --pt_embedding=/path/to/embeddings/embedding-name_embedding_matrix.pk --embed_size=embedding-dimensionWe have trained each model for 5000 iterations and saved as checkpoints. You can directly download the pretrained models with this link: https://drive.google.com/drive/folders/1-aIVMb4jCJF515KcSpH2uLSfV88ZqxlK?usp=sharing
Give the pretrained model directory as --checkpoint_dir parameter to the model to continue training or evaluate/test with it.
While testing and evaluating results, batch size and beam size must be equal.
When mode is set to eval, the code outputs the ROUGE score according to the files found in --data_dir. Default number to evaluate ROUGE score is 5.
When mode is set to test, the code outputs some translation examples to directory given as parameter to test_save_dir.
For English Models with Embedding:
python src/main.py --mode=eval --data_dir=/path/to/tfrecords_finished_files/chunked_val --vocab_path=/path/to/tfrecords_finished_files/vocab --checkpoint_dir=/path/to/Checkpoints/embedding-name --pt_embedding=/path/to/embeddings/embedding-name_embedding_matrix.pk --embed_size=embedding-dimension --batch_size=4 --beam_size=4For Turkish Models with Embedding:
python src/main.py --mode=eval --data_dir=/path/to/mlsum/val.tfrecords --vocab_path=/path/to/mlsum/vocab --checkpoint_dir=/path/to/Checkpoints/embedding-name --pt_embedding=/path/to/embeddings/embedding-name_embedding_matrix.pk --embed_size=embedding-dimension --batch_size=4 --beam_size=4cd src
streamlit run app.pyCode base is taken from https://github.com/steph1793/Pointer_Generator_Summarizer
