Question about training a BERTax model for phylum to species taxonomy classification 

Hi! I have read your paper about BERTax. It is wonderful and very inspiring. I'm interested in training a BERTax model for my own application: predict the phylum, class, order, family, genus, and species of a DNA sequence. Since I need to predict six labels, I plan to add three more taxonomy layers after the original BERTax taxonomy layers. Also, I need to use different training and testing datasets. Currently, my dataset looks like this:
species_1.fasta:
  >sequence_1
  ATCG...
  >sequence_2
  ATCG...
  ...

species_2.fasta
   >sequence_1
   ATCG...
   >sequence_2
   ATCG...
   ...

species_n.faste
   >sequence_1
   ATCG...
   >sequence_2
   ATCG...
   ...
where each fasta file is a species that has a corresponding taxonomy label (from phylum to species). Inside each fasta file, there may contain more than one sequence of this species. 

**I have read your instruction about how to prepare the data for training. I think I should convert my data into this format:**
<img width="861" alt="Screen Shot 2023-06-01 at 3 44 03 PM" src="https://github.com/f-kretschmer/bertax_training/assets/61462983/9d444150-f47e-4bb8-a337-366646490b94">

Thank you very much if you could provide me with some suggestions about my task!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question about training a BERTax model for phylum to species taxonomy classification #10

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Question about training a BERTax model for phylum to species taxonomy classification #10

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions