Database requests
- Organism name: Oncorhynchus nerka (https://www.ncbi.nlm.nih.gov/datasets/taxonomy/8023) (sockeye salmon)
- Link gene definition file (e.g. GTF / GFF / GenBank): https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/034/236/695/GCF_034236695.1_Oner_Uvic_2.0/GCF_034236695.1_Oner_Uvic_2.0_genomic.gtf.gz
- Link to Genome FASTA file/s: https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/034/236/695/GCF_034236695.1_Oner_Uvic_2.0/GCF_034236695.1_Oner_Uvic_2.0_genomic.fna.gz
- Link to CDS FASTA file: https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/034/236/695/GCF_034236695.1_Oner_Uvic_2.0/GCF_034236695.1_Oner_Uvic_2.0_cds_from_genomic.fna.gz
- Link to Protein FASTA file: https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/034/236/695/GCF_034236695.1_Oner_Uvic_2.0/GCF_034236695.1_Oner_Uvic_2.0_protein.faa.gz
Thank you for any support on this request. I tried to install the database manually, but after a while I was not able to get it to work without error.
Specifically, although I was able to get the sequences.fa, cds.fa, and genes.gtf to be read properly by the command:
java -Xmx20g -jar snpEff.jar build -gtf22 -v Oner_Uvic_2.0 2>&1 | tee Oner_Uvic_2.0.build
...but it resulted in the following error at the data checking stage of the script:
FATAL ERROR: No CDS checked. This might be caused by differences in FASTA file
transcript IDs respect to database's transcript's IDs.
Transcript IDs from database (sample):
'unassigned_transcript_3780'
'XM_029687664.2'
'unassigned_transcript_3781'
'unassigned_transcript_3778'
'unassigned_transcript_3779'
'unassigned_transcript_3784'
'unassigned_transcript_3782'
'unassigned_transcript_3783'
'XM_065019568.1'
'XM_029661614.2'
'XM_029627598.2'
'XR_003863663.2'
'XR_003863664.2'
'XM_065018188.1'
'XM_029627603.2'
'XM_029627600.2'
'XM_065007313.1'
'XM_029627601.2'
'XM_029627599.2'
'XM_029659812.2'
'XM_029627610.2'
'XM_029627611.2'
Transcript IDs from database (fasta file):
'lcl|NC_088413.1_cds_XP_064859648.1_41028'
'lcl|NC_088415.1_cds_XP_029477646.1_45063'
'lcl|NC_088404.1_cds_XP_064878381.1_19670'
'lcl|NC_088419.1_cds_XP_064865376.1_56512'
'lcl|NC_088405.1_cds_XP_029525536.1_21965'
'lcl|NC_088410.1_cds_XP_064858049.1_37672'
'lcl|NC_088414.1_cds_XP_029503937.2_44172'
'lcl|NC_088419.1_cds_XP_064864584.1_55919'
'lcl|NW_027039711.1_cds_XP_064871287.1_68469'
'lcl|NC_088423.1_cds_XP_064869196.1_65209'
'lcl|NC_088404.1_cds_XP_029524279.2_20779'
'lcl|NC_088398.1_cds_XP_064862988.1_4603'
'lcl|NC_088402.1_cds_XP_064876276.1_14724'
'lcl|NC_088415.1_cds_XP_029478572.1_46073'
'lcl|NC_088415.1_cds_XP_064861311.1_44650'
'lcl|NC_088410.1_cds_XP_064857172.1_35158'
'lcl|NC_088418.1_cds_XP_064864437.1_53888'
'lcl|NC_088407.1_cds_XP_029530877.1_27482'
'lcl|NC_088417.1_cds_XP_029482230.1_49588'
'1_cds_XP_029508511'
'1_cds_XP_029508512'
'1_cds_XP_029508515'
Searching online, I see similar errors, and it appears that a long-form accession ID is provided within the fasta file that does not match that used by the gtf file, but unfortunately, I am not able to resolve this.
Thank you very much for any help with this and please let me know if I can provide any additional information to help this genome be included within the snpEff genomic database.
Ben
Database requests
Thank you for any support on this request. I tried to install the database manually, but after a while I was not able to get it to work without error.
Specifically, although I was able to get the
sequences.fa,cds.fa, andgenes.gtfto be read properly by the command:java -Xmx20g -jar snpEff.jar build -gtf22 -v Oner_Uvic_2.0 2>&1 | tee Oner_Uvic_2.0.build...but it resulted in the following error at the data checking stage of the script:
Searching online, I see similar errors, and it appears that a long-form accession ID is provided within the fasta file that does not match that used by the gtf file, but unfortunately, I am not able to resolve this.
Thank you very much for any help with this and please let me know if I can provide any additional information to help this genome be included within the snpEff genomic database.
Ben