Skip to content

Inputs and outputs

milnus edited this page Nov 22, 2022 · 11 revisions

File to describe input and output parameters more in depth

Inputs

Genomes

Genome(s) of interest in which seed sequences should be used to extract or examine a genomic region. These can be in fasta format or GFF with an appended genome. Input genomes can be Gzipped, but Gzipped and non-Gzipped input files can not be mixed.

Examples of input genomes:

Fasta:

>Sample_genome_1
TCAGTACGCTAGAAGTACGTCTGATGCATGTAGATCGACTGATGTGTC

GFF with appended genome:

##gff-version 3
Sample_genome_1 gene 5   15  .  +   0   locus_tag='gene_00001'
##FASTA
>Sample_genome_1
TCAGTACGCTAGAAGTACGTCTGATGCATGTAGATCGACTGATGTGTC

Seed sequences

Seed sequences should be stretched of nucleotides (A, T, C or G) or amino acids given in a multi-fasta file. Each fasta entry should have a unique name, but should allow Magphi to identify intended mate sequences for pairs. The easiest way to ensure this is to name each seed sequence pair and append the individuals in the pair with either _1 or _2. Magphi will check if the given number of seed sequences is an equal number, meaning each seed sequence has a mate to form a pair with. This dictates that a sequence used in multiple seed sequence pairs has to be be duplicated and named appropriately (i.e. >phageA_2 and >spec_phage1 in example below). Magphi will exit with an error if seed sequence pairing is unsuccessful and guide you to problematic names. Example:

>phageA_1
CTGGTCGGTTGGTATATAGTGTTCGCACTG.....
>phageA_2
CACTTGTAACACAACCAGACCCCCCCGGAA.....
>spec_phage1
CACTTGTAACACAACCAGACCCCCCCGGAA.....
>spec_phage2
TGCAGTTCCCCATGGGGTGTAGGTGTAACG.....

Max distance

-md is used to indicate the distance allowed between the individual seed sequences in a pair or the seed sequence and the end of a contig. It is good to have a rough idea of the maximum distance expected, as a too large maximum distance often results in poor extraction of the intended region, due to seed sequences' ability to reach both ends of a contig in draft genomes. When working with draft genomes, decreasing the maximum distance to improve evidence levels should be done with caution, as this can extract regions other than the intended. It is therefore better to stay cautious and use a max distance not too small nor too large.

Include seeds

-ip can be used to include the sequence to which seed sequences are aligned. This is useful for some types of analysis, as it allows the researcher to have a common reference point across extractions.

Protein seed flag

-p can be given to alter the blast method used by Magphi. The default in blastn, but -p wil change it to tblastn. tblastn will also be used if the input seeds given does not live up to nucleotide expectations (only being A, T, G, and C). If Magphi change from blastn to tblastn without being given the -p flag, a warning will be thrown, but the program will not stop.

Output altering flags

Mapghi will by default only output fasta and Gff files for complete stretches of sequence and annotations. This means that seed sequences connecting to contig breaks will not result in any fasta of Gff file outputs.

  • -b can be used to output fasta and Gff files for seed sequences connecting to contig breaks.
  • -n can be used to avoid any fasta of Gff files being outputted from any type of evidence level. This can be used for structural analysis where placement or content between seed sequences is important, or for screening of large datasets where output files take up too much memory.
  • -S use this flag to stop the process of orienting output fasta and Gff files to have the first seed in a pair as the beginning of the sequence.

Outputs

Seed_pairing.tsv

Gives the common name identified for a seed sequence pair and the two seed sequences found belonging to the pair. Can be used to check if Magphi identified seed sequence pairs as intended by the user.

contig_hit_matrix.csv

Indicates the number of locations a pair of seed sequences were identified to map in a given genome by BLAST. Can be used to evaluate seed specificity.

inter_seed_distance.csv

Distance between merged seed sequences in a pair. Additional columns will be added for each seed sequence that is found to be next to the end of a contig. This output Can be used to evaluate insertions of genetic material between seed sequences. It can further be used to determine if seeds have changed position in the genome, especially in complete genomes with a large maximum distance allowed.

annotation_num_matrix.csv

The number of annotated features extracted from a GFF file, if any can be found between merged seed sequences. This can be used to evaluate changes in gene/genetic feature content. nan will be inserted if there is no region found for a GFF file.

Evidence levels

master_seed_evidence.csv - Overview of how each seed sequence pair was found to map, merge, and extract for a specific genome. This is easily the most useful output for evaluating seed sequence quality, and large scale impression of output quality.

  1. One or no seed sequence hit the genome
  2. Multiple seed sequences hit with no overlap
  3. 'Multiple 'seed sequences hit with multiple connections
  4. Seed(s) deleted due to overlap or placed at end of contig and seeds are excluded
  • A. Two Seed sequences hit on seperate contigs - No connection

  • B. Two Seed sequences hit on seperate contigs - with connection no annotations

  • C.Two Seed sequences hit on seperate contigs - with connection and annotations

  • A. Two seed sequences hit on same contig - not connected

  • B. Two seed sequences hit on same contig - connected no annotations

  • C. Two seed sequences hit on same contig - connected with annotations

Expected fasta and Gff outputs:

-n will result in no outputs 0-3 no outputs 4 and 5: A. No outputs B. Fasta outputs (-b for 4B) C. Fasta and annotation outputs (-b for 4C)

Fasta and Gff files

From Magphi v2.0.0 as a default output fasta and Gff files are oriented to have the first seed in a pair, in the beginning of the output file on the positive strand. This is done to make it easier to take the output fastas straight from Magphi and align them, without having to reverse, complement, or reverse-complement before aligning.

Exit statuses

  1. Problem with input file
  2. Problem with input to command line
  3. Problem with dependency