Skip to content

Wublab/SeqGenerator

Repository files navigation

SeqGenerator

Function-orientedEnabling Diverse Enzyme Design viathrough Function-Oriented Sequence-driven Diffusion Model

The training and generating process of our protein sequence diffusion model.

Setup:

The code is based on PyTorch and HuggingFace transformers.

pip install -r requirements.txt 

Datasets

Prepare datasets and put them under the datasets folder. Take datasets/aspartese as an example.

Training

cd scripts
bash train.sh

Arguments explanation:

  • --max_len: the maximum length of the natrual sequences
  • --min_len: the minimum length of the natrual sequences
  • --dataset: the name of datasets, just for notation
  • --data_dir: the path to the saved datasets folder, containing train.csv valid.csv
  • --resume_checkpoint: if not none, restore this checkpoint and continue training
  • --model_path: the path to the used pretrained ESM-2 model, here we use "esm2_t30_150M_UR50D" which can be download here and put them to "diffusion_models/esm_orig"

Generating

You need to modify the path to model_dir, which is obtained in the training stage.

cd scripts
bash run_decode.sh

Arguments explanation:

  • --model_dir: the model obtained in the training stage, our trained model can be accessed here, for generating put the model and 'training_args.json' to this folder
  • --seq_len_sample: the generated sequence length is obtained by sampling the length of the natural sequences of this family
  • --max_len: the maximum length of the generated sequence
  • --min_len: the minimum length of the generated sequence
  • --seq_num: the number of sequences generated

Acknowledgements

The code in this project is based on DiffuSeq and ESM-2. Special thanks to the original authors for their contributions to the open-source community.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published