Skip to content

MMARINeDNA/taxonomic-reference-database

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

13 Commits
 
 
 
 
 
 

Repository files navigation

taxonomic-reference-database

This repo contains the code used by the Module 3 team to build reference databases for taxonomic classification of metabarcode sequence data. Reference sequences are pulled from NCBI Genbank using rentrez in R. Following this, taxonomic classifications are pulled from the Integrated Taxonomic Information System (ITIS) based on the species information included in the title line of each sequence, using taxize in R.

To use the R code build_reference_database_sedna.R:

You will need to edit four lines in the top code chunk, Load environment.

  • First, create an empty folder where you want the reference fastas to live, and add the path to the variable reference_fasta_file_location.
  • Second, add one or more taxa that you want to search to the variable organism.
  • Third, add one or more loci that you want to search to the variable locus.
  • Finally, include terms that you want to exclude in the vector exclude. Some suggestions are already included here.

How it works:

Create fasta files: The second code chunk iterates over the entries supplied in organism and locus to generate a separate .fasta file for every organism-locus combination. It interfaces with NCBI Genbank via the entrez database enquiry system, and returns sequences titled with the Genbank accession number, species scientific name, and additional information (e.g. locus).

Acquire species taxonomy: The rest of the code chunks pull scientific taxonomy and common name information from ITIS based on species names in the NCBI-supplied title for each sequence and export them as a .csv file corresponding to each .fasta file. It then concatenates the scientific taxonomy and replaces the sequence title line with the concatenated taxonomy in the format needed for taxonomic classification by dada2 and/or QIIME2.

IMPORTANT NOTES:

Keep in mind that this code is not completely generalized and may not fit your specific needs. It is also not set up to handle errors generated by variable data input, e.g. if sequences were uploaded to Genbank with incomplete or incorrectly spelled species names. In the specific case mentioned here, we have added some code to try to remove as many sequences as possible that have incorrectly formatted species, but if one slips through it is likely to generate a 404 search engine error that will stop the code from running. If this happens, note which search terms generated the error and remove sequences with those terms from the fasta file, or edit the code to remove sequences containing the offensive pattern (e.g. species entered as T.truncatus instead of Tursiops truncatus will not be recognized as two separate words, therefore the ITIS search will use "T.trucatus" and whatever is the next word in the title, e.g. "T.truncatus mitochondrial". This will create a 404 search error.)

To run this on sedna:

Use the bash script included here to submit a job to SLURM on Sedna (or your preferred HPC). Be sure to change the time and memory requests to suit your job, change the job title and name of the log file, and, of course, change the email address where you want updates sent!

Note that in order to run this on Sedna (or any HPC), you must have the required R packages installed. If they are not installed, you can open an interactive R session using the command R, and use the command install.packages to install the missing packages. If you are running this on an HPC that is not Sedna, the module containing R may be called something different. Be sure to load the correct module so that you can run R.

About

Code and files used to build reference databases for taxonomic classification of metabarcode sequence data.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors