"Sequence record does not appear to be DNA" when calculating ANI for vRhyme bins
Body:
Hello, my name is Min-uk Park in seoul national university (SNU)
I am currently working with viral genomes and used vRhyme for binning. As a result, the binned contigs were scaffolded together using a gap of 1500 'N's, which is the standard output format for vRhyme.
However, when I try to calculate ANI (Average Nucleotide Identity) using ex: vclust / fastANI], I encounter the following error:
[ Error: The sequence record 'vRhyme_bin_233' does not appear to be DNA.]
Here is an example of what my input FASTA sequence looks like. It consists of normal DNA bases separated by long stretches of 'N's:
vRhyme_bin_233
AATGGCCCATGTCTGTCATTCGGATTTCCTCCGAAAACCCGGACCGGCTC...
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN... (1500 Ns)
...CGATTTTCTTTTACTTCCACACGAGAATCACCATATTCTCGCGATTTG...
I suspect the ANI tool is throwing this error because of the long strings of 'N's used for scaffolding.
My questions are:
How should I handle vRhyme bins for ANI calculation?
Is there a specific parameter in to ignore these 'N's? Or is it highly recommended to split the scaffolds back into individual contigs by breaking them at the 'N's before running the ANI analysis?
Any advice or recommended scripts to resolve this would be greatly appreciated!
Thank you.
"Sequence record does not appear to be DNA" when calculating ANI for vRhyme bins
Body:
Hello, my name is Min-uk Park in seoul national university (SNU)
I am currently working with viral genomes and used vRhyme for binning. As a result, the binned contigs were scaffolded together using a gap of 1500 'N's, which is the standard output format for vRhyme.
However, when I try to calculate ANI (Average Nucleotide Identity) using ex: vclust / fastANI], I encounter the following error:
[ Error: The sequence record 'vRhyme_bin_233' does not appear to be DNA.]
Here is an example of what my input FASTA sequence looks like. It consists of normal DNA bases separated by long stretches of 'N's:
My questions are:
How should I handle vRhyme bins for ANI calculation?
Is there a specific parameter in to ignore these 'N's? Or is it highly recommended to split the scaffolds back into individual contigs by breaking them at the 'N's before running the ANI analysis?
Any advice or recommended scripts to resolve this would be greatly appreciated!
Thank you.