Contributing

Add a dataset 📕

First, fork the repository by clicking on the 'Fork' button in the repository. Then, clone your fork locally with:

git clone git@github.com:<your Github username>/bio-datasets.git
cd bio-datasets
git remote add upstream https://github.com/DeepChainBio/bio-datasets.git

Set your environment and install pre-commit hooks by following the project setup section.
Create a new branch: git checkout -b contribute/name-of-your-dataset.
Prepare your dataset. The latter must contains the 3 following files:
- dataset.csv - which is your dataset (all your features + targets) without any embeddings.
- info.json - an information file for the hub frontend - you can find a template here.
- description.md - a description file for your dataset - you can find a template here.
Complete your dataset with embeddings for some of its feature sequences:
- You can add an embeddings file per [sequence, model, pooling_type], e.g. sequence1_protbert_cls_embeddings.npy.
- The given model should be the one used to compute these embeddings (e.g. protbert).
- The pooling type refers to the way the embeddings are extracted from the last layer of the pre-trained Transformer , i.e. cls or mean.
- One embeddings file should be written as the following: <column_name_in_dataset.csv>_<model_name_to_compute_embeddings>_<pooling_type>_embeddings.npy.
- For now, only this format (.csv for the dataset + .npy files for embeddings) is supported. The plan is to integrate different formats in the next weeks!
Create a folder for your dataset in datasets/ and add the description.md file in it.
- The remaining files will need to be added to a google bucket. This is explained in the template with which you will open your pull request.
Commit your changes:

git add datasets/<your_dataset_name>
git commit

Rebase on the upstream main branch and push your changes:

git fetch upstream
git rebase upstream/main
git push -u origin contribute/name-of-your-dataset

Go to the webpage of your fork on GitHub. Click on "Pull request" and use the PR template new_dataset_pull_request_template.md.

You can directly create the PR by copy/pasting https://github.com/DeepChainBio/bio-datasets/compare/main...contribute/name-of-your-dataset?template=new_dataset_pull_request_template.md&title=Contribute+%3Cdataset_name%3E+&labels=dataset with the name of your branch instead of contribute/name-of-your-dataset.

Complete the check-list and wait for your PR to be reviewed, and your dataset to be added.

All good, you've officially contributed a publicly available protein dataset. 🚀

Project setup 🤓

create the conda environment

conda env create -f environment.yaml

install the pre-commit

conda activate biodatasets
(biodatasets) pre-commit install

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Contributing

Add a dataset 📕

Project setup 🤓

FilesExpand file tree

CONTRIBUTING.md

Latest commit

History

CONTRIBUTING.md

File metadata and controls

Contributing

Add a dataset 📕

Project setup 🤓