Skip to content

Latest commit

 

History

History
63 lines (47 loc) · 3.2 KB

File metadata and controls

63 lines (47 loc) · 3.2 KB

Contributing

Add a dataset 📕

  1. First, fork the repository by clicking on the 'Fork' button in the repository. Then, clone your fork locally with:
git clone git@github.com:<your Github username>/bio-datasets.git
cd bio-datasets
git remote add upstream https://github.com/DeepChainBio/bio-datasets.git
  1. Set your environment and install pre-commit hooks by following the project setup section.

  2. Create a new branch: git checkout -b contribute/name-of-your-dataset.

  3. Prepare your dataset. The latter must contains the 3 following files:

    • dataset.csv - which is your dataset (all your features + targets) without any embeddings.
    • info.json - an information file for the hub frontend - you can find a template here.
    • description.md - a description file for your dataset - you can find a template here.
  4. Complete your dataset with embeddings for some of its feature sequences:

    • You can add an embeddings file per [sequence, model, pooling_type], e.g. sequence1_protbert_cls_embeddings.npy.
    • The given model should be the one used to compute these embeddings (e.g. protbert).
    • The pooling type refers to the way the embeddings are extracted from the last layer of the pre-trained Transformer , i.e. cls or mean.
    • One embeddings file should be written as the following: <column_name_in_dataset.csv>_<model_name_to_compute_embeddings>_<pooling_type>_embeddings.npy.
    • For now, only this format (.csv for the dataset + .npy files for embeddings) is supported. The plan is to integrate different formats in the next weeks!
  5. Create a folder for your dataset in datasets/ and add the description.md file in it.

    • The remaining files will need to be added to a google bucket. This is explained in the template with which you will open your pull request.
  6. Commit your changes:

git add datasets/<your_dataset_name>
git commit
  1. Rebase on the upstream main branch and push your changes:
git fetch upstream
git rebase upstream/main
git push -u origin contribute/name-of-your-dataset
  1. Go to the webpage of your fork on GitHub. Click on "Pull request" and use the PR template new_dataset_pull_request_template.md.

You can directly create the PR by copy/pasting https://github.com/DeepChainBio/bio-datasets/compare/main...contribute/name-of-your-dataset?template=new_dataset_pull_request_template.md&title=Contribute+%3Cdataset_name%3E+&labels=dataset with the name of your branch instead of contribute/name-of-your-dataset.

  1. Complete the check-list and wait for your PR to be reviewed, and your dataset to be added.

All good, you've officially contributed a publicly available protein dataset. 🚀

Project setup 🤓

  • create the conda environment
conda env create -f environment.yaml
  • install the pre-commit
conda activate biodatasets
(biodatasets) pre-commit install