Official implementation for sequence-only peptide developability screening with GQP.
Designing peptides for microplastic targeting is intrinsically multi-objective: sequence motifs that promote adsorption to hydrophobic polymers frequently elevate developability risks, including hemolysis, non-specific adsorption, and poor aqueous solubility. In this paper, we show that accurate developability screening can be achieved from sequence alone by focusing on the readout that converts token-level foundation model representations into peptide-level decisions. We introduce gated query pooling (GQP), a lightweight, backbone-agnostic evidence-selection head that learns a small set of query vectors to extract complementary signals from protein language model embeddings and gates them adaptively per peptide. Across three developability tasks, GQP paired with sequence-only backbones attains 91.09% accuracy for hemolysis, 86.30% for non-fouling, and 75.56% for solubility, matching or exceeding representative sequence-only and AlphaFold-structure-augmented Multi-Peptide baseline while providing superior performance under limited labeled data. Beyond predictive accuracy, attention diagnostics and controlled counterfactual substitutions enable residue-level, testable design rules that connect model outputs to actionable sequence edits. Finally, integrating these developability constraints with PepBD-derived affinity scores for polyethylene, polypropylene, and polyethylene terephthalate supports scalable multi-objective prioritization of microplastic-binding candidates and reveals non-fouling as a dominant feasibility bottleneck, with coarse-grained molecular dynamics triage providing complementary physical evidence supporting the plausibility of the PepBD-prioritized selections.
Run inside GQP/:
python -m venv .venv
source .venv/bin/activate
pip install -U pip
pip install -r requirements.txtSingle-task training example (hemo):
python train.py train \
--train_json datasets/jsonl/hemo/train.jsonl \
--val_json datasets/jsonl/hemo/val.jsonl \
--out_dir runs/gqp_demo/hemo \
--esm_backend hf \
--esm_model facebook/esm2_t33_650M_UR50D \
--pool_type prompt \
--prompt_tokens 4 \
--train_backboneEvaluation:
python train.py eval \
--model_dir runs/gqp_demo/hemo \
--test_json datasets/jsonl/hemo/val.jsonlBatch run (hemo/sol/nf):
bash train_gqp.shAttention diagnostics:
python scripts/diagnostics/attention/eval_attn_char_stats.py \
--model_dir runs/gqp_demo/hemo \
--test_json datasets/jsonl/hemo/val.jsonl \
--out_dir outputs/attention/hemoMain output:
frequency_weighted_attention_mass_difference.png
Controlled counterfactual substitutions (CSE):
python scripts/diagnostics/counterfactual/eval_ism_cse.py \
--model_dir runs/gqp_demo/hemo \
--jsonl datasets/jsonl/hemo/val.jsonl \
--out_dir outputs/cse/hemoMain outputs:
controlled_substitution_effect_heatmap_<task>.pngresidue_intervenability_barplot_<task>.png
@misc{chen2026gqp_peptide_developability,
title = {Rethinking Peptide Developability with Sequence-Only Models: Interpretable Screening of Microplastic-Binding Peptides with Gated Query Pooling},
author = {Guangyao Chen and Fengqi You},
year = {2026},
}