Automated annotation and curation of nitrogenase genes using profile hidden Markov models and conserved residue matching.
-
1_align_mafft.sh - Clusters (mmseqs easy-cluster) HMM seed sequences at 90% AAI. Aligns (MAFFT) clustered seeds sequences. Builds HMM profiles and combines into single file. Stores alignmnets in alignments/ and profiles in HMMs/.
-
2_hmmsearch.sh - Runs pHMM search against GTDB R220 all_rep_proteins_aa database (bacteria and archaea). Stores results as individual files in results/{archaea | bacteria} (per GTDB accession number)
-
3_parse_hmm.sh - Calls (1) Parse_hmm_results.py and (2) Parse_hmm_results.py.
-
4_conserved_res.sh - Calls (1) aln_nif_hits.py, (2) conserved-res.py, (3) final-fasta-export.py, and (4) diazoDB-check.py.
-
5_make_trees.sh -
-
6_operon_org.sh
-
7_SSN.sh
-
Parse_hmm_results.py - Parse pHMM search results (see below criteria), combines with GTDB taxonomy data, and stores results in hits.feather
- positive bit score
- full sequence evalue must be significant (<0.01)
- best domain evalue should be significant (<0.01)
- otherwise flagged for manual review to check if it is distant homolog or just short repeats
-
Parse_tophits.py
-
aln_nif_hits.py
-
cluster_pos.py
-
conserved-res.py
-
final-fasta-export.py
-
diazoDB-check.sh
-
get-operon.py
-
operon-org-plot.py