Skip to content

Question about training a BERTax model for phylum to species taxonomy classification  #10

@Steven-GUHK

Description

@Steven-GUHK

Hi! I have read your paper about BERTax. It is wonderful and very inspiring. I'm interested in training a BERTax model for my own application: predict the phylum, class, order, family, genus, and species of a DNA sequence. Since I need to predict six labels, I plan to add three more taxonomy layers after the original BERTax taxonomy layers. Also, I need to use different training and testing datasets. Currently, my dataset looks like this:
species_1.fasta:

sequence_1
ATCG...
sequence_2
ATCG...
...

species_2.fasta

sequence_1
ATCG...
sequence_2
ATCG...
...

species_n.faste

sequence_1
ATCG...
sequence_2
ATCG...
...
where each fasta file is a species that has a corresponding taxonomy label (from phylum to species). Inside each fasta file, there may contain more than one sequence of this species.

I have read your instruction about how to prepare the data for training. I think I should convert my data into this format:
Screen Shot 2023-06-01 at 3 44 03 PM

Thank you very much if you could provide me with some suggestions about my task!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions