Protify provides several ways to view and explore the supported models and datasets. This documentation explains how to use these features.
The main README.md file contains expandable toggle sections for both models and datasets:
- Currently Supported Models: Click the toggle to expand and see a complete table of models with their descriptions, sizes, and types.
- Currently Supported Datasets: Click the toggle to expand and see a complete table of datasets with their descriptions, types, and tasks.
Protify provides command-line utilities for listing models and datasets with detailed information:
To list all supported models with their descriptions:
# List all supported models
python -m src.protify.base_models.get_base_models --list
# To download standard models
python -m src.protify.base_models.get_base_models --downloadTo list all supported datasets with their descriptions:
# List all datasets
python -m src.protify.data.dataset_utils --list
# Get information about a specific dataset
python -m src.protify.data.dataset_utils --info ECFor a combined view of both models and datasets:
# List both models and datasets
python -m src.protify.resource_info --all
# List only standard models and datasets
python -m src.protify.resource_info --all --standard-only
# List only models
python -m src.protify.resource_info --models
# List only datasets
python -m src.protify.resource_info --datasetsYou can also access model and dataset information programmatically:
# For models
from src.protify.resource_info import model_descriptions
from src.protify.base_models.get_base_models import currently_supported_models, standard_models
# Get information about a specific model
model_info = model_descriptions.get('ESM2-150', {})
print(f"Model: ESM2-150")
print(f"Description: {model_info.get('description', 'N/A')}")
print(f"Size: {model_info.get('size', 'N/A')}")
print(f"Type: {model_info.get('type', 'N/A')}")
# For datasets
from src.protify.resource_info import dataset_descriptions
from src.protify.data.supported_datasets import supported_datasets
# Get information about a specific dataset
dataset_info = dataset_descriptions.get('EC', {})
print(f"Dataset: EC")
print(f"Description: {dataset_info.get('description', 'N/A')}")
print(f"Type: {dataset_info.get('type', 'N/A')}")
print(f"Task: {dataset_info.get('task', 'N/A')}")Models in Protify are generally grouped into the following categories:
- Protein Language Models: Pre-trained models that have learned protein properties from large-scale sequence data (e.g., ESM2, ProtBert)
- Baseline Controls: Models with random weights for comparison (e.g., Random, Random-Transformer)
Datasets are categorized by their task types:
- Multi-label Classification: Datasets where each protein can have multiple labels (e.g., EC, GO-CC)
- Classification: Binary or multi-class classification tasks (e.g., DeepLoc-2, DeepLoc-10)
- Regression: Prediction of continuous values (e.g., enzyme-kcat, optimal-temperature)
- Protein-Protein Interaction: Tasks focused on protein interactions (e.g., human-ppi, gold-ppi)
- Token-wise Classification/Regression: Residue-level prediction tasks (e.g., SecondaryStructure-3)