-
Notifications
You must be signed in to change notification settings - Fork 0
Description
As I understand it, the cath-gemma starting clusters are currently generated in the following way:
- Get all sequences from latest version of Gene3D (here it is v16) for each CATH superfamily
- Get all the UniProt accessions associated with these sequences
- Retrieve the GO annotations using the EBI Proteins API
- Cluster the sequences within each superfamily at 90% sequence identity with CD-HIT
- Identify which S90 clusters contain GO experimental evidence (i.e. not IEA, ND, NAS )
- If zero S90 clusters are retained due to having no GO experimental evidence, run again but include the GO evidence types IEA:UniProtKB-KW and IEA:UniProtKB-EC
This has resulted in 4742 CATH superfamilies (~77%) having 2 or more starting clusters. The rest (1377 of the 6119 in CATH v4.2) have 0 or 1 starting cluster(s).
With this current method, I believe that a superfamily may lose a dramatic proportion of its S90 clusters if only very few of them have experimental evidence. However if a superfamily 'loses' all the S90s in this removal process then they might gain a significant proportion of the S90 clusters back when using IEA keyword evidence codes. This could lead to, for example, large superfamilies being represented by a few, small, experimentally-based starting clusters and small superfamilies being represented by a large proportion of remaining IEA-based starting clusters with high sequence coverage.
I'm wondering whether it could be worth setting the threshold at which IEA keyword evidence codes are used to a proportion of starting clusters remaining rather than waiting until the number hits zero. This might allow for the retention of more starting cluster data, and therefore higher superfamily sequence space coverage, which could help improve FunFam data coverage.