- First, fork the repository by clicking on the 'Fork' button in the repository. Then, clone your fork locally with:
git clone git@github.com:<your Github username>/bio-datasets.git
cd bio-datasets
git remote add upstream https://github.com/DeepChainBio/bio-datasets.git-
Set your environment and install pre-commit hooks by following the project setup section.
-
Create a new branch:
git checkout -b contribute/name-of-your-dataset. -
Prepare your dataset. The latter must contains the 3 following files:
-
Complete your dataset with embeddings for some of its feature sequences:
- You can add an embeddings file per [sequence, model, pooling_type], e.g.
sequence1_protbert_cls_embeddings.npy. - The given model should be the one used to compute these embeddings (e.g.
protbert). - The pooling type refers to the way the embeddings are extracted from the last layer of the pre-trained Transformer , i.e.
clsormean. - One embeddings file should be written as the following:
<column_name_in_dataset.csv>_<model_name_to_compute_embeddings>_<pooling_type>_embeddings.npy. - For now, only this format (
.csvfor the dataset +.npyfiles for embeddings) is supported. The plan is to integrate different formats in the next weeks!
- You can add an embeddings file per [sequence, model, pooling_type], e.g.
-
Create a folder for your dataset in
datasets/and add thedescription.mdfile in it.- The remaining files will need to be added to a google bucket. This is explained in the template with which you will open your pull request.
-
Commit your changes:
git add datasets/<your_dataset_name>
git commit- Rebase on the upstream
mainbranch and push your changes:
git fetch upstream
git rebase upstream/main
git push -u origin contribute/name-of-your-dataset- Go to the webpage of your fork on GitHub. Click on "Pull request" and use the PR template new_dataset_pull_request_template.md.
You can directly create the PR by copy/pasting https://github.com/DeepChainBio/bio-datasets/compare/main...contribute/name-of-your-dataset?template=new_dataset_pull_request_template.md&title=Contribute+%3Cdataset_name%3E+&labels=dataset with the name of your branch instead of
contribute/name-of-your-dataset.
- Complete the check-list and wait for your PR to be reviewed, and your dataset to be added.
All good, you've officially contributed a publicly available protein dataset. 🚀
- create the conda environment
conda env create -f environment.yaml- install the pre-commit
conda activate biodatasets
(biodatasets) pre-commit install