A task-agnostic facility to pretrain/finetune and use a Transformer based model to classify your DNA/RNA sequences.
This repo is a wrapper around HuggingFace adapted to biological sequences. It's useful to easily pretrain, finetune and use a transformer architecture powered by HuggingFace for a classification task.
Follow recommandations to install PyTorch on your system. Example:
conda install pytorch -c pytorchThen, install TransforKmers with pip:
pip install git+https://github.com/IGDRion/transforkmersThis command run the model on every sequences of the fasta, and write results in a csv file.
transforkmers predict \
--model_path_or_name /path/to/pretrained/model/dir/ \
--tokenizer /path/to/tokenizer/dir/ \
--input /path/to/input.faIf you want to train a new model from scratch, several steps are provided.
- For real examples, check used scripts inside
/scripts/. - If you want to use an already pretrained model, go to step 4. See also an usage in ANNEXA for gene TSSs (transcript Start Sites) classification.
Next, you need a tokenizer to split and label your input sequences in kmers.
transforkmers tokenizer \
--k 6 \
--alphabet "A,T,C,G" \
--max-len 512 \
--output config/tokenizer_k6_512/If you want to use amino acid sequences, use all 20 AA in alphabet and set a k of 1.
The pretraining is done with the Masked Language Model (MLM) method.
To do so, you first need to pick random sequences inside a reference genome.
It can be done with the command below:
transforkmers pretrain-dataset \
--fasta /path/to/reference/fasta \
--num-seq 2000000Once you have your pretraining dataset and your tokenizer, you can pretrain from scratch your transformer based model. To see all the models availabled, please check HuggingFace.
transforkmers pretrain \
--output_dir "models" \
--model_path_or_name "HuggingFaceModelType" \
--tokenizer /path/to/tokenizer/dir/ \
--dataset /path/to/dataset.txtOverload your model config with:
--overload_config "hidden_size=516,num_hidden_layers=8"
Once your model is pretrained, it can be used in any classification task.
First, you have to create your finetune dataset.
--inputs takes a list of fasta, and maps each sequences to its rank (from 0).
transforkmers finetune-dataset \
--inputs \
/path/to/a.fa \ # Each sequences will be class 0.
/path/to/b.fa \ # Each sequences will be class 1.
/path/to/c.fa \ # Each sequences will be class 2.
/path/to/d.fa \ # Each sequences will be class 3.
--split 75,10,15 \
--max-len 512You can now finetune your model on the previously generated train and valid datasets.
transforkmers finetune \
--model_path_or_name /path/to/pretrained/model/dir/ \
--tokenizer /path/to/tokenizer/dir/ \
--train_dataset /path/to/train.csv \
--eval_dataset /path/to/eval.csvAfter the finetuning, you can test your model with your test dataset.
transforkmers test \
--model_path_or_name /path/to/finetuned/model/dir/ \
--tokenizer /path/to/tokenizer/dir/ \
--test_dataset /path/to/test.csv