Hi! I have read your paper about BERTax. It is wonderful and very inspiring. I'm interested in training a BERTax model for my own application: predict the phylum, class, order, family, genus, and species of a DNA sequence. Since I need to predict six labels, I plan to add three more taxonomy layers after the original BERTax taxonomy layers. Also, I need to use different training and testing datasets. Currently, my dataset looks like this:
species_1.fasta:
sequence_1
ATCG...
sequence_2
ATCG...
...
species_2.fasta
sequence_1
ATCG...
sequence_2
ATCG...
...
species_n.faste
sequence_1
ATCG...
sequence_2
ATCG...
...
where each fasta file is a species that has a corresponding taxonomy label (from phylum to species). Inside each fasta file, there may contain more than one sequence of this species.
I have read your instruction about how to prepare the data for training. I think I should convert my data into this format:

Thank you very much if you could provide me with some suggestions about my task!
Hi! I have read your paper about BERTax. It is wonderful and very inspiring. I'm interested in training a BERTax model for my own application: predict the phylum, class, order, family, genus, and species of a DNA sequence. Since I need to predict six labels, I plan to add three more taxonomy layers after the original BERTax taxonomy layers. Also, I need to use different training and testing datasets. Currently, my dataset looks like this:
species_1.fasta:
species_2.fasta
species_n.faste
I have read your instruction about how to prepare the data for training. I think I should convert my data into this format:

Thank you very much if you could provide me with some suggestions about my task!