-
Notifications
You must be signed in to change notification settings - Fork 0
Tony Notes: v4.0.0
Taking data from /cath/people2/ucbcdal/dfx_funfam2013_data/projects/gene3d_12/
(Raw sequences are in /cath/people2/ucbcdal/dfx_funfam2013_data/shared/used/domseq_data_gene3d_12_31072013
Can get list of v4.0.0 superfamilies with:
grep -Po '^\d+\.\d+\.\d+\.\d+' /cath/data/v4_0_0/release_data/CathNames | sort -V
v4.0.0 has 2,738 superfamilies.
The starting_clusters directory has 2,735 superfamilies, ie three missing superfamilies:
1.20.860.10 2.10.70.30 3.30.1710.10
Try to find any reference to them:
ssh cathstor1
find /cath/people2/ucbcdal/dfx_funfam2013_data/projects/gene3d_12/ -name '*1.20.860.10*' -o -name '*2.10.70.30*' -o -name '*3.30.1710.10*'
Result is that they (only) appear here:
/cath/people2/ucbcdal/dfx_funfam2013_data/projects/gene3d_12/mappings/cath_4.0/target_mfasta/1.20.860.10.faa
/cath/people2/ucbcdal/dfx_funfam2013_data/projects/gene3d_12/mappings/cath_4.0/target_mfasta/2.10.70.30.faa
/cath/people2/ucbcdal/dfx_funfam2013_data/projects/gene3d_12/mappings/cath_4.0/target_mfasta/3.30.1710.10.faa
Which contains one .faa file for each superfamily. Commands to check that:
ls -1 /cath/people2/ucbcdal/dfx_funfam2013_data/projects/gene3d_12/mappings/cath_4.0/target_mfasta/ -v | grep -P '\.faa$' | sed 's/\.faa$//g' > /tmp/mfasta_list.txt
grep -Po '^\d+\.\d+\.\d+\.\d+' /cath/data/v4_0_0/release_data/CathNames | sort -V > /tmp/all_sf_ids.txt
diff /tmp/all_sf_ids.txt /tmp/mfasta_list.txt
Looks like exc_starting_clusters are excluded starting clusters, perhaps the ones without any annotations?
cd /cath/people2/ucbcdal/dfx_funfam2013_data/projects/gene3d_12/
ls -1 *start*/1.10.8.110/* | tr '/' ' ' | sort -V -k 3
To get the counts of clusters in the superfamilies:
cd /cath/people2/ucbcdal/dfx_funfam2013_data/projects/gene3d_12/
find starting_clusters/ -iname '*.faa' | tr '/' ' ' | awk '{print $2}' | uniq -c | sort -g
Twenty biggest:
26289 3.40.50.300
8751 3.40.50.720
7764 3.40.50.620
7602 2.60.40.10
7231 3.20.20.70
7177 2.40.50.140
5684 3.40.50.150
4813 2.160.10.10
4572 3.30.420.40
4496 1.10.10.10
4123 3.30.930.10
3968 3.90.1570.10
3630 3.90.1150.10
3617 2.40.30.10
3386 2.130.10.10
3274 3.30.200.20
3228 1.10.510.10
2927 2.10.25.10
2919 3.40.640.10
2908 1.20.1250.20
To calculate the sum:
find starting_clusters/ -iname '*.faa' | tr '/' ' ' | awk '{print $2}' | grep -P '^\d+\.\d+\.\d+\.\d+$' | uniq -c | sort -g | awk '{the_sum += $1} END {print the_sum}'
The result is: 585,033
To also add the all-vs-all sizes:
cd /cath/people2/ucbcdal/dfx_funfam2013_data/projects/gene3d_12/
find starting_clusters/ -iname '*.faa' | tr '/' ' ' | awk '{print $2}' | grep -P '^\d+\.\d+\.\d+\.\d+$' | uniq -c | sort -g | awk '{print $2 " " $1 " " ( ( $1 * ( $1 - 1 ) ) / 2 ) }' | column -t
...and same divide by (6000*3600):
cd /cath/people2/ucbcdal/dfx_funfam2013_data/projects/gene3d_12/
find starting_clusters/ -iname '*.faa' | tr '/' ' ' | awk '{print $2}' | grep -P '^\d+\.\d+\.\d+\.\d+$' | uniq -c | sort -g | awk '{print $2 " " $1 " " ( ( $1 * ( $1 - 1 ) ) / 2 ) " " ( ( $1 * ( $1 - 1 ) ) / ( 2 * 5760 * 3600 ) ) }' | column -t
To get the total all-vs-all size:
cd /cath/people2/ucbcdal/dfx_funfam2013_data/projects/gene3d_12/
find starting_clusters/ -iname '*.faa' | tr '/' ' ' | awk '{print $2}' | grep -P '^\d+\.\d+\.\d+\.\d+$' | uniq -c | sort -g | awk '{print ( $1 * ( $1 - 1 ) ) / 2 }' | awk '{the_sum += $1} END {print the_sum}'
The result is: 791,487,405 (about 1.6 days at 5760/s). Of course, this is just the comparisons for the leaf nodes)