How it works

Magphi will try to extract regions of interest given a set of genomes either with or without annotation (Fasta of GFF3 format), and a set of seed sequences in a multi sequence fasta format. Magphi is Based loosely on the principle of tools like seqkit but for larger and less specific priming regions and is able to work on GFF files.

First step is pairing seed sequences given as input to identify pairs that should be used to define regions of interest.
Next step for Magphi is to use BLAST to identify the locations of the seed sequence pairs within a given genome(s). Magphi by default try to use blastn but will use tblastn if input seeds contain characters not being A, T, G, and C, or if given the -p flag. After determining the location of seeds, bedtools is used to try merge locations of seed sequences based on a set maximum distance provided by the user (default 50 kb). If merger of seed sequence locations is possible, this new region will be used to extract the resulting sequence and possibly annotation file.
Resulting fasta and GFF files will be divided between folders in the output directory, with one folder per input seed sequence pair. Other outputs include tables summarising number of genomic locations found for a seed sequence pair, distance between connected seed sequences, number of annotations between connected seed sequences, how seed sequences were paired, and evidence levels.

Handling contig breaks

In the event of seed sequences being placed on separate contigs, Magphi will attempt to 'connect' seed sequences by allowing connections between seed sequences and a contig break, with a max distance equal to the max distance given by the user (default: 50). If connections can be successfully identified, with only one contig break to seed sequence connection per seed sequence output file(s) can be obtained using -b, and their evidence level will be increased. Output files given using -b can be useful but should always be inspected or analysed with a critical mind to determine if the correct hit has been obtained.

Seed sequences next to a contig break are named like the following: genome-seed_sequnce_1_break.fasta genome-seed_seqeunce_2_break.fasta

Merging seed sequences into pairs

Names of seed sequences are used to merge these into pairs. This process is done iteratively by searching for the longest common name between seed sequences not already found to pair. If a unique mate to a given seed sequence can not be identified in the remaining set of seed sequence names, then it is returned to the remaining set of seed sequence names and the next is searched for. This process can be done up to 1000 times before the process of combining seed sequences is stopped.

The best way of naming seeds is to choose a name and then appending _1 and _2 for each of the two seed sequences. This should result in a pool of unique seed sequences that should not collide and cause problems during the pairing process. It is however a good idea to quickly check the seed_pairing.tsv following runs with multiple seeds not used previously.

How input genomes are recognised

Fasta files

Fasta files are recognised by having a > followed by at least one additional character in the first line. Files are also checked for empty lines in the middle of the fasta file, as this is not allowed in fasta files.

GFF files

This format is checked for the presence of ##gff-version 3 as the first line and ##FASTA to separate lines of annotations from the genome.

Disagree with how files are recognised? Have I violated or overlooked a format feature of use? Please let me know, I am happy to learn!

Using complete genomes

As Fasta and GFF formats do not have a robust way of indicating whether a genome is circularised or not, Magphi handles all genomes or stretches of genomes as being linear. This may cause issues when dealing with things around the origin of replication.