Skip to content

Database request: Oncorhynchus nerka sockeye salmon #628

@bensutherland

Description

@bensutherland

Database requests

  1. Organism name: Oncorhynchus nerka (https://www.ncbi.nlm.nih.gov/datasets/taxonomy/8023) (sockeye salmon)
  2. Link gene definition file (e.g. GTF / GFF / GenBank): https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/034/236/695/GCF_034236695.1_Oner_Uvic_2.0/GCF_034236695.1_Oner_Uvic_2.0_genomic.gtf.gz
  3. Link to Genome FASTA file/s: https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/034/236/695/GCF_034236695.1_Oner_Uvic_2.0/GCF_034236695.1_Oner_Uvic_2.0_genomic.fna.gz
  4. Link to CDS FASTA file: https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/034/236/695/GCF_034236695.1_Oner_Uvic_2.0/GCF_034236695.1_Oner_Uvic_2.0_cds_from_genomic.fna.gz
  5. Link to Protein FASTA file: https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/034/236/695/GCF_034236695.1_Oner_Uvic_2.0/GCF_034236695.1_Oner_Uvic_2.0_protein.faa.gz

Thank you for any support on this request. I tried to install the database manually, but after a while I was not able to get it to work without error.
Specifically, although I was able to get the sequences.fa, cds.fa, and genes.gtf to be read properly by the command:
java -Xmx20g -jar snpEff.jar build -gtf22 -v Oner_Uvic_2.0 2>&1 | tee Oner_Uvic_2.0.build
...but it resulted in the following error at the data checking stage of the script:

FATAL ERROR: No CDS checked. This might be caused by differences in FASTA file
transcript IDs respect to database's transcript's IDs.
Transcript IDs from database (sample):
        'unassigned_transcript_3780'
        'XM_029687664.2'
        'unassigned_transcript_3781'
        'unassigned_transcript_3778'
        'unassigned_transcript_3779'
        'unassigned_transcript_3784'
        'unassigned_transcript_3782'
        'unassigned_transcript_3783'
        'XM_065019568.1'
        'XM_029661614.2'
        'XM_029627598.2'
        'XR_003863663.2'
        'XR_003863664.2'
        'XM_065018188.1'
        'XM_029627603.2'
        'XM_029627600.2'
        'XM_065007313.1'
        'XM_029627601.2'
        'XM_029627599.2'
        'XM_029659812.2'
        'XM_029627610.2'
        'XM_029627611.2'
Transcript IDs from database (fasta file):
        'lcl|NC_088413.1_cds_XP_064859648.1_41028'
        'lcl|NC_088415.1_cds_XP_029477646.1_45063'
        'lcl|NC_088404.1_cds_XP_064878381.1_19670'
        'lcl|NC_088419.1_cds_XP_064865376.1_56512'
        'lcl|NC_088405.1_cds_XP_029525536.1_21965'
        'lcl|NC_088410.1_cds_XP_064858049.1_37672'
        'lcl|NC_088414.1_cds_XP_029503937.2_44172'
        'lcl|NC_088419.1_cds_XP_064864584.1_55919'
        'lcl|NW_027039711.1_cds_XP_064871287.1_68469'
        'lcl|NC_088423.1_cds_XP_064869196.1_65209'
        'lcl|NC_088404.1_cds_XP_029524279.2_20779'
        'lcl|NC_088398.1_cds_XP_064862988.1_4603'
        'lcl|NC_088402.1_cds_XP_064876276.1_14724'
        'lcl|NC_088415.1_cds_XP_029478572.1_46073'
        'lcl|NC_088415.1_cds_XP_064861311.1_44650'
        'lcl|NC_088410.1_cds_XP_064857172.1_35158'
        'lcl|NC_088418.1_cds_XP_064864437.1_53888'
        'lcl|NC_088407.1_cds_XP_029530877.1_27482'
        'lcl|NC_088417.1_cds_XP_029482230.1_49588'
        '1_cds_XP_029508511'
        '1_cds_XP_029508512'
        '1_cds_XP_029508515'

Searching online, I see similar errors, and it appears that a long-form accession ID is provided within the fasta file that does not match that used by the gtf file, but unfortunately, I am not able to resolve this.

Thank you very much for any help with this and please let me know if I can provide any additional information to help this genome be included within the snpEff genomic database.
Ben

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions