This repository provides tools to cluster protein sequences into families using ESM embeddings, visualize the networks, and compare computed communities with known PFAM families.
Author: Burcin Acar
- Load protein sequences from a CSV file.
- Compute ESM embeddings using a locally downloaded model.
- Compute pairwise distances between sequences based on embedding similarities.
- Dynamically select a clustering threshold distance.
- Construct a graph network and detect communities using Louvain clustering.
- Save the output CSV with corresponding community assignments.
- Compare with PFAM families and save the final annotated CSV.
- Interactively inspect graph networks
- Clone the repository:
git clone https://github.com/acarbn/protein_pfam_clustering.git
cd protein_fam_clustering- Create and activate a virtual environment (optional but recommended):
python -m venv venv
# macOS / Linux
source venv/bin/activate
# Windows
venv\Scripts\activate- Install dependencies from requirements.txt:
pip install -r requirements.txtStep 1: Download ESM Model
Downloads the tokenizer and the model weights.
python3 -m src.download_esm_model esm2_t33_650M_UR50DStep 2: Cluster Protein Sequences
Outputs embeddings, distance matrix, communities, and network visualization in results/.
python3 -m src.main data/seqs.csv --model_name esm2_t33_650M_localStep 3: Compare Communities with PFAM
Produces eval/<model_name>_proteins_with_pfam.csv with PFAM annotations.
python3 -m src.compare_communities_pfam esm2_t33_650M_local.
├── data/ # Input CSVs and later generated embeddings
├── model/ # Local ESM models
├── results/ # Clustering outputs and plots
├── eval/ # PFAM annotations and comparison outputs
├── main.py # Main clustering script
├── helper.py # Helper functions
├── download_esm_model.py # ESM model downloader
├── compare_communities_pfam.py # PFAM comparison script
├── requirements.txt # Python dependencies
└── README.md