-
Notifications
You must be signed in to change notification settings - Fork 0
04. Functional Annotation
Prokka was used for functional annotation, and the code used for this can be found in the CODE folder under PROKKA_SLURM.sh. From this analysis the following results were given:
- contigs: 9
- bases: 3154995
- CDS: 3127
- rRNA: 18
- tRNA: 70
- tmRNA: 1
What types of features are detected by the software? Which ones are more reliable a priori?
Prokka finds all the features printed above. A priori, the most reliable annotations are rRNAs, tRNAs, tmRNAs and the CDS, which are annotated as a known protein. This is due to them being mapped to sequences where the function of these is known, thus we are given previous knowledge about this and therefore the reliability is increased. As for the less reliable annotations, hypothetical proteins are proteins in which the function is "guessed", and therefore we cannot be sure if this information is correct to the same extent as those which are substantiated by previous annotations.
How many features of each kind are detected in your contigs? Do you detect the same number of features as the authors? How do they differ? The features detected in the contigs from this analysis are 3127 CDS, 18 rRNA, 70 tRNA and 1 tmRNA. In Zhang et. al, it is stated that the genome is predicted to contain 3095 coding sequences, whilst the other feature information is omitted. The difference in the number of coding sequences is accordingly, 32.
Why is it more difficult to do the functional annotation in eukaryotic genomes?
To begin to answer this question, one must first understand the differences between eukaryotic and prokaryotic genetic material. Prokaryotes have only exons, meaning that their entire genome encodes functional genetic material. In eukaryotes, we have both exons and introns, meaning that the entire genome does not encode functional genetic material. In addition to this, we have something called alternative splicing, which is process different combinations of splice sites within a pre-mRNA being selected to produce differently spliced mRNAs. Through this, one ORF can encode multiple proteins, as it can be spliced in different ways, which is not the case for prokaryotes. With the introns, and the way that eukaryotic genomes are built, there is usually a "sea of introns" in which few exons hide, making them difficult to find purely practically. Another factor is that not all exons have start and stop codons within them, making them difficult to find for an algorithm that exclusively looks for AUG (start) UAA, UAG, UGA (stop).
How many genes are annotated as ‘hypothetical protein’? Why is that so? How would you tackle that problem?
1359 genes in this genetic material are annotated as hypothetical proteins.
To understand why genes can be annotated in this manner, one must first understand that the annotation process takes place through matching the sequences to sequences found on databases. When no match is found for something that has been annotated as a gene, this is instead annotated as a hypothetical protein. To tackle this issue, one can consider the fact that the assembly has not performed well enough for the sequence to match other sequences, and therefore one should re-assemble the data using either a different assembly software (if there is a chance that one could have used a suboptimal assembler for the sequence data obtained) or with completely different sequence data. In this case, we have data from Nanopore and Illumina reads, which could be used to complement the current data, however the assembly from these reads was found to be quite a bit worse than the PacBio assembly, so this might not be optimal here. In addition to this, one can compare the genes with other databases to see if there are any others that can have successfully annotated that gene.
How can you evaluate the quality of the obtained functional annotation?
A comparison between the annotation and another annotation of the same genome can give a lot of information about the quality, especially if the reference genome is curated, meaning that there is high confidence in said annotation. Through this comparison, one can get an idea of how good ones annotation is, by seeing how similar it is to the reference.
How comparable are the results obtained from two different structural annotation softwares?
As different softwares use different algorithms to perform structural annotations, the results can differ between these.