Skip to content

Tony Notes: v4.0.0

Tony E Lewis edited this page May 30, 2017 · 2 revisions

Taking data from /cath/people2/ucbcdal/dfx_funfam2013_data/projects/gene3d_12/

(Raw sequences are in /cath/people2/ucbcdal/dfx_funfam2013_data/shared/used/domseq_data_gene3d_12_31072013

Can get list of v4.0.0 superfamilies with:

grep -Po '^\d+\.\d+\.\d+\.\d+' /cath/data/v4_0_0/release_data/CathNames | sort -V

v4.0.0 has 2,738 superfamilies.

The starting_clusters directory has 2,735 superfamilies, ie three missing superfamilies:

1.20.860.10 2.10.70.30 3.30.1710.10

Try to find any reference to them:

ssh cathstor1
find /cath/people2/ucbcdal/dfx_funfam2013_data/projects/gene3d_12/ -name '*1.20.860.10*' -o -name '*2.10.70.30*' -o -name '*3.30.1710.10*'

Result is that they (only) appear here:

/cath/people2/ucbcdal/dfx_funfam2013_data/projects/gene3d_12/mappings/cath_4.0/target_mfasta/1.20.860.10.faa
/cath/people2/ucbcdal/dfx_funfam2013_data/projects/gene3d_12/mappings/cath_4.0/target_mfasta/2.10.70.30.faa
/cath/people2/ucbcdal/dfx_funfam2013_data/projects/gene3d_12/mappings/cath_4.0/target_mfasta/3.30.1710.10.faa

Which contains one .faa file for each superfamily. Commands to check that:

ls -1 /cath/people2/ucbcdal/dfx_funfam2013_data/projects/gene3d_12/mappings/cath_4.0/target_mfasta/ -v | grep -P '\.faa$' | sed 's/\.faa$//g' > /tmp/mfasta_list.txt
grep -Po '^\d+\.\d+\.\d+\.\d+' /cath/data/v4_0_0/release_data/CathNames | sort -V > /tmp/all_sf_ids.txt
diff /tmp/all_sf_ids.txt /tmp/mfasta_list.txt

Looks like exc_starting_clusters are excluded starting clusters, perhaps the ones without any annotations?

cd /cath/people2/ucbcdal/dfx_funfam2013_data/projects/gene3d_12/
ls -1 *start*/1.10.8.110/* | tr '/' ' ' | sort -V -k 3

To get the counts of clusters in the superfamilies:

cd /cath/people2/ucbcdal/dfx_funfam2013_data/projects/gene3d_12/
find starting_clusters/ -iname '*.faa' | tr '/' ' ' | awk '{print $2}' | uniq -c | sort -g

Twenty biggest:

26289 3.40.50.300
 8751 3.40.50.720
 7764 3.40.50.620
 7602 2.60.40.10
 7231 3.20.20.70
 7177 2.40.50.140
 5684 3.40.50.150
 4813 2.160.10.10
 4572 3.30.420.40
 4496 1.10.10.10
 4123 3.30.930.10
 3968 3.90.1570.10
 3630 3.90.1150.10
 3617 2.40.30.10
 3386 2.130.10.10
 3274 3.30.200.20
 3228 1.10.510.10
 2927 2.10.25.10
 2919 3.40.640.10
 2908 1.20.1250.20

To calculate the sum:

find starting_clusters/ -iname '*.faa' | tr '/' ' ' | awk '{print $2}' | grep -P '^\d+\.\d+\.\d+\.\d+$' | uniq -c | sort -g | awk '{the_sum += $1} END {print the_sum}'

The result is: 585,033

To also add the all-vs-all sizes:

cd /cath/people2/ucbcdal/dfx_funfam2013_data/projects/gene3d_12/
find starting_clusters/ -iname '*.faa' | tr '/' ' ' | awk '{print $2}' | grep -P '^\d+\.\d+\.\d+\.\d+$' | uniq -c | sort -g | awk '{print $2 " " $1 " " ( ( $1 * ( $1 - 1 ) ) / 2 ) }' | column -t

...and same divide by (6000*3600):

cd /cath/people2/ucbcdal/dfx_funfam2013_data/projects/gene3d_12/
find starting_clusters/ -iname '*.faa' | tr '/' ' ' | awk '{print $2}' | grep -P '^\d+\.\d+\.\d+\.\d+$' | uniq -c | sort -g | awk '{print $2 " " $1 " " ( ( $1 * ( $1 - 1 ) ) / 2 ) " " ( ( $1 * ( $1 - 1 ) ) / ( 2 * 5760 * 3600 ) ) }' | column -t

To get the total all-vs-all size:

cd /cath/people2/ucbcdal/dfx_funfam2013_data/projects/gene3d_12/
find starting_clusters/ -iname '*.faa' | tr '/' ' ' | awk '{print $2}' | grep -P '^\d+\.\d+\.\d+\.\d+$' | uniq -c | sort -g | awk '{print ( $1 * ( $1 - 1 ) ) / 2 }' | awk '{the_sum += $1} END {print the_sum}'

The result is: 791,487,405 (about 1.6 days at 5760/s). Of course, this is just the comparisons for the leaf nodes)

Clone this wiki locally