analyzing how attention patterns in dayhoffs protein language models align with structural contacts in protein 3D structures.
- Structure Processing: Downloads PDB files and extracts Cα coordinates
- Contact Map Generation: Computes binary contact maps based on distance thresholds (default: 8Å)
- Attention Extraction: Extracts attention patterns from specified layers of protein language models
- Correlation Analysis: Measures what proportion of high-attention pairs correspond to structural contacts
- Python 3.10+
- CUDA-capable GPU (recommended)
- Clone the repository
git clone https://github.com/yourusername/attention-contact-correlation.git
cd attention-contact-correlation- Create conda environment
conda create -n prot python=3.10
conda activate prot- Install PyTorch with CUDA support
For CUDA 11.8:
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118For CUDA 12.1:
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121- Install dependencies
pip install -r requirements.txt- Create PDB list file
Create a file pdb_ids.py with your PDB IDs by running parse_data.py script (you can change the number of parsed sequences in parse_data.py directly:
python parse_data.pyTo caclulate the proportion of high attention token pairs that correspond to contact, the following steps were taken :
- Extract the high attention pairs (can specify with --top_k_percent arg)
- Calulate the contact maps (residues that are closer than 8.0 A)
- Report the proportion of high attention pairs that are also in contact
python attn.py \
--top_k_percent 0.01 \
--data_sample 100 \
--save_dir results \
--batch_size 32accelerate launch attn.py \
--top_k_percent 0.01 \
--data_sample 100 \
--save_dir results \
--batch_size 32 \
--model_id microsoft/Dayhoff-3b-GR-HM-c| Argument | Type | Default | Description |
|---|---|---|---|
--model_id |
str | microsoft/Dayhoff-3b-GR-HM-c |
Model identifier from HuggingFace |
--top_k_percent |
float | 0.01 | Top-k percent threshold for high-attention pairs |
--data_sample |
int | None | Number of PDB structures to analyze (None = all) |
--save_dir |
str | results |
Directory to save output CSVs |
--layer_indices |
int+ | [0, 1, 2] | Which model layers to analyze |
--contact_thresh |
float | 8.0 | Distance threshold (Å) for defining contacts |
--chain_id |
str | None | Specific chain ID to analyze (None = first valid) |
--batch_size |
int | 8 | Number of sequences to process in parallel |
--precision |
str | fp16 |
Model precision (fp16/fp32) |
Analyze specific layers with custom contact threshold:
python gp.py \
--layer_indices 0 5 10 15 \
--contact_thresh 10.0 \
--batch_size 16Process a small sample for testing:
python gp.py \
--data_sample 10 \
--batch_size 4Analyze a different model:
python gp.py \
--model_id microsoft/Dayhoff-170m-UR50 \
--save_dir results/dayhoff_170mThe script generates a CSV file for each model:
results/
└── Dayhoff-3b-GR-HM-c_prop_matrix.csv
CSV Structure:
- Rows: Attention heads (e.g.,
head_0,head_1, ...) - Columns: Model layers (e.g.,
layer_0,layer_1, ...) - Values: Proportion of high-attention pairs that are structural contacts (0.0 to 1.0)
Failed structures are logged to failed_pdb_ids.log:
1ABC Failed to extract sequence/contacts
2DEF Processing error: Invalid chain
The following Microsoft Dayhoff models are supported:
microsoft/Dayhoff-3b-GR-HM-cmicrosoft/Dayhoff-3b-GR-HMmicrosoft/Dayhoff-3b-UR90microsoft/Dayhoff-170m-UR50-BRnmicrosoft/Dayhoff-170m-UR50-BRqmicrosoft/Dayhoff-170m-UR50-BRumicrosoft/Dayhoff-170m-GRmicrosoft/Dayhoff-170m-UR90microsoft/Dayhoff-170m-UR50
- FP16 precision on GPU for memory efficiency
- Batch processing of sequences
- Periodic CUDA cache clearing
- GPU tensor operations throughout pipeline
A contact is defined as two Cα atoms within a distance threshold (default 8Å), excluding:
- Self-interactions (diagonal)
- Immediate neighbors (optional)
# Reduce batch size
python gp.py --batch_size 4
# Use smaller model
python gp.py --model_id microsoft/Dayhoff-170m-UR50- Check internet connection
- Verify PDB IDs are valid (4-character codes)
- Check RCSB PDB server status
- Some proteins may have non-standard residues
- Check
failed_pdb_ids.logfor details - These structures are automatically skipped
- Microsoft Research for the Dayhoff protein language models