Skip to content

scripts used to analyse horizontal transfer and evolution of transposable elements in 307 vertebrate species

Notifications You must be signed in to change notification settings

HeloiseMuller/HTvertebrates

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

HTvertebrates

Scripts used in "Phylogenetic relatedness rather than aquatic habitat fosters horizontal transfer of transposable elements in animals" by Héloïse Muller, Rosina Savisaar, Jean Peccoud, Sylvain Charlat, Clément Gilbert (doi: 10.1101/gr.280432.125)

This is a branch of the scripts used in "Horizontal transfer and evolution of transposable elements in vertebrates" by Hua-Hao Zhang, Jean Peccoud, Min-Rui-Xuan Xu, Xiao-Gu Zhang, Clément Gilbert (doi: 10.1038/s41467-020-15149-4).

These scripts are publicly available to indicate how parts of the analysis were automated. There is no guaranty regarding their use. For those who want to use the pipeline, see below:

Requirements

The pipeline was not tested with other versions of the above programs, but more recent versions probably work.

Hardware requirements: a linux cluster with

  • ≥200 CPUs
  • ≥0.5 TB of system memory
  • ≥2 TB of free hard drive space
  • an internet connection to download the genome sequences (>10 MB/s is recommended).

On this hardware, the pipeline on the 247 animal genomes should take a few months to complete with a similar taxa composition.

File description

The R scripts whose names start with numbers performed successive stages of the analysis. The purpose of each script is described by comments at the beginning of the script.

  • HTvFunctions.R contains functions required for the other scripts and is sourced automatically.
  • circularPlots.R contains functions used to draw Figure 2 of the paper.
  • The remaining scripts are launched via Rscript (for long, CPU-intensive tasks) from the scripts whose names start with numbers.

The directory additonal_files contains

  • Zhang2020, which contains files from the paper Zhang et al. (2020)
    • supplementary-data1-genomes_and_accessions.txt gives general information about the genomes and is used to download genomes sequences from ncbi.
    • ftp_links.txt contains URL to the genome sequences, also used to download genome sequences.
    • timetree.nwk is the timetree (newick format) used through the analysis.
    • namedClades.txt is a table of major vertebrate clades in this tree, with their names and color codes used to make some of the paper's figures (these figures are generated by the scripts).
    • superF.txt makes the correspondance between repeatModeler family codes (first column), TE class (2nd column) and more common TE superfamily names (3rd column). It is used in stages 15 and 16.
    • supplementary-data3-TEcomposition_per_species.txt is generatd by the scripts and is provided with the paper, but we also provide it here if to facilitate the reproduction of the results.
  • Muller, which contains files from the paper Muller et al. (2025)
    • Datasets correspond to the additional datasets of the paper. Because of size limitations, only the first 10,000 lines are included for DatasetS3 and S4, and Dataset S7 is not including. All and whole datasets can be found at https://doi.org/10.5281/zenodo.15297793. DatasetS2 is essential to run the pipeline. It is the phylogenetic tree containing all the species of the dataset. The tips are as follow: species_genus. It has to contain the divergence time, although unprecisions are no big deal as the detection of HT is only based on the topology. The important is to have a correct topology and no polytomy.
    • metadata.tbl is an essential input for the pipeline. It is a tab delimited table that has to contain at least these three columns:
      • assembly: the name of the assembly (GCA_xxx.1); the fasta should start by this
      • species: species name written as "species genus"
      • dbBusco: the busco database to use for this genome
      • Any additional columns, such as taxonomical or ecological data

Scripts need the following in $PATH: seqtk, blastn

  • The directory demo_TeKaKs is provided to demo the script TEKaKs.R (see Demonstration of TEKaKs.R below), but is not required to run the pipeline.

Steps of the pipeline

STEP A : Download genomes

  • 01-downloadGenomes.R is the script used by Zhang et al. (2020) to download genomes from a file containing accession links.
  • 01-downloadGenomes_2.R.sh is the script used in Muller el al. (2025) to download the selected genomes from the list of all genomes available.

STEP B : Detection of similar TEs

  • 02-ExtractCopies.R comments explaining how to annotate TE copies. The script extract these copies.
  • 03a-findDubiousTEs.sh and 03b-findDubiousTEs.R detect dubious TE, ie elements annotated as TE that might not be TE
  • 04-mapTE.R permforms similarity search between TE copies of pairs of species

STEP C : Generate dS distribution under vertical inheritance

  • 05-coreGenedS.R include the launch of 05bis_coreGenesdNdS.R
  • 05tris-FindNamesBusco.sh has to be run manually at line 333 of 05-coreGenedS.R.

STEP B & STEP C are independent; they can be run at the same time.

STEP D : Identify TE-TE hits resulting from HT

This step can be started only once both STEP B & STEP C are done

  • 06-filterTEhits.R runs 06bis-filterTEblastnHits.R and 06tris-successiveBlastx.R.
  • Similarly, 07-dNdSAndHTTfilter.R runs 07bis-TEdNdS.R. - 07-08-Precision.R is an addition that was not present in Zhang et al. (2020). This script has to be run between scripts 07 and 08.

STEP E : Clustering

Here we used two independent methods for clustering:

  • Clustering per clade, similarly to what was done in Zhang et al. 2020.
  • Clustering per pair of species, developed for this study

Both methods have to start with 08-prepareClustering.Rto prepare the clustering. One has to comment, or uncomment, the method they want to use, or not use, at the beginning of the script.
08-prepareClustering.R cannot be run from bash. Starting line 177, one has to read the comments to continue.
ATTENTION Make sure you added "WAIT" between each line, as explained line 181, before running the bash scripts generated by 08-prepareClustering.R.

Then, there are different versions of scripts 09 and 10 depending on the method.
For clustering per clade :

  • 09-hitClusteringRound1_perClade.R launches 09bis-iterativeFirstClustering.R
  • 10-hitClusteringRound2_perClade.R launch 10bis_iterativeSecondClustering.R For clustering per pairs of species :
  • 09-hitClusteringRound1_perPairs.R does not launch other scripts (step of iterativeFirstClustering are integrated in the script)
  • 10-hitClusteringRound2_perPairs.R does not launch other scripts either.

Regarding scripts 10, make sure to run clustering 10-hitClusteringRound2_perClade.R before running 10-hitClusteringRound2_perPairs.R, as the steps in common are not repeated. Indeed, 10-hitClusteringRound2_perPairs.R does not regenerate involvedProtOCC200dS05.self.out.

STEP F : Apply filters to keep confident hit groups only

11-hitGroupEvaluation.R can filter out hit groups of low confidence clusterized per clade (in this case, uncomment method <- "perClade") or clusterized per pairs of species (in this case, uncomment method <- "perPairs"). If one wants to run both, the scripts has to be run two times.

STEP G : Analyses in common with Zhang et al. (2020)

Scripts 12 to 16 have to be run on final output of script 10-hitClusteringRound2_perClade.R, i.e. on occ200HitGroup_perclade.txt.

  • 12-HTTeventCount.R estimates the minimum number of independent transfers that took place across the dataset.
  • 13-TEdNdSWithinGenomes.R computes dN and dS values for related copies within a genomes.
  • 14-showHTTonTree.R plots a circular phylogeny of all the dataset and shows the independent HTT events estimated in 12-HTTeventCount.R.
  • 15-testHTTexcess.R tests whether there is an excess of HTT in some taxonomic groups thanks to a permutation approch.
  • 16-TEcompositionAndEvolution.R generates bar plots to illustrate the total number of copies annotated for each TE superfamilies and the total number of HTT events they are involved in. It also compare the dNdS under vertical transmission (values generated in 13-TEdNdSWithinGenomes.R ) and under HT.

STEP H : Analyses specific to Muller et al. (2025)

These scripts do not follow any numbering. Those analyses have to be run on the clustering that was done independently for each pairs of species, i.e. on occ200HitGroup_perPairs.txt

  • test_lifeStyle.R tests for an excess of transfers in the aquatic habitat.
  • test_habitatSharing.R tests for an excess of transfers in a shared habitat (i.e. aquatic -> aquatic or terrestrial --> terrestrial )
  • test_phylogeneticProximity_global.R tests the effect of the phylogenetic proximity globally.
  • test_bayesian.R runs all Bayesian analyses.
    The following is run on the other clustering approach:
  • test_TELosses tests for losses of TE.

Adapting this pipeline to other datasets, hardware configuration, and automating all procedures requires modifications to the code. Some parts of the analysis were not automated.

Installation

In a bash-compatible terminal that can execute git, paste

git clone https://github.com/HeloiseMuller/HTvertebrates.git
cd HTvertebrates/

Demonstration of TEKaKs.R to compute pairwise Ka and Ks on transposable elements

We detail how to run TEKaKs.R on a demo dataset, but we remind that this script (as all others) is not intended for use in any other context than the study associated with the paper.

This script computes Ka, Ks, and well as overall molecular distances on pairs of homologous transposable elements (TEs), based on HSPs between these TEs and on HSPs between TEs and proteins (HSPs are not generated by this script). See the paper's method section for a description of the approach.

The hardware requirement for this demo is a Mac/Unix/Linux computer with at last 8GB of RAM, 1GB of free hard drive space, and which is able to execute R 3.4+ in a terminal. A Windows computer cannot run the script as some R functions are not supported under windows (namely those of the parallel R package).

A Mac computer may have issues installing the igraph R package from sources, as macOS lacks a fortran compiler. However, the igraph package may be installed manually by specifying NOT to install packages from sources (which is not possible to do via Rscript).

The other programs mentioned in the Requirements section need not be installed for this demo.

The demo_TeKaKs/ directory must be immediately within the HTvertebrates/ directory. It contains the following:

  • TEhitFile.txt is a file of TE-TE HSPs in typical blast tabular format, but only listing sequence names and HSP coordinates.
  • blastxFile.txt is a tabular file of TE-protein HSPs. The fields indicate the TE sequence name, start and end coordinates of the HSP on this sequence (where start < end), start coordinate of the HSP on the protein and whether the TE sequence in aligned on the protein in reverse direction.
  • fastaFile.fas is a fasta file of the TE sequences whose names are in the two aforementioned files.

The nature of these files is also explained by comments in TEKaKs.R.

To run the demo, paste the following in the terminal session that you used to install the pipeline:

Rscript TEKaKs.R demo_TeKaKs/TEhitFile.txt demo_TeKaKs/blastxFile.txt demo_TeKaKs/fastaFile.fas demo_TeKaKs/output 2

where demo_TeKaKs/output is the output folder (automatically created) and 2 is the number of CPUs to use.

The script should run in less than 5 minutes on a standard desktop PC.

Results will be found in demo_TeKaKs/output/allKaKs.txt. This tabular file contains the following fields:

  • hit is an identifier for each HSP, which corresponds to the row index of each HSP in TEhitFile.txt.
  • ka, ks, vka and vks are the results of Ka and Ks computations (see the kaks() function of seqinr),
  • length is the length of the alignment on which the above were computed.
  • nMut is the number of substitutions in this alignment.
  • K80distance and rawDistance are molecular distances (according to Kimura 1980 or without any correction) between sequences in the HSP. These are computed before any of the processing required for the Ka Ks computations (the removal of certain nucleotides and codons, see the method section of the paper).

Output of the pipeline published in Zhang et al. (2020)

More than 1TB of intermediate files are generated. The final output corresponds to results of the publication (please see the publication for their description).

  • Figure2.pdf, Figure3.pdf and Figure4.pdf are produced at scripts 14, 15 and 16 respectively. They correspond to figures of the main text
  • FigureS1.pdf is generated at stage 5. It corresponds to supplementary figure 1.
  • FigureS2.pdf is generated at stage 11. It corresponds to supplementary figure 2.
  • FigureS[3-6].pdf are generated at stage 16. They corresponds to supplementary figures 3-6.
  • tableS1.txt is generated at stage 15. It corresponds to supplementary table 1.
  • tableS2.txt is generated at stage 16. It corresponds to supplementary table 2.
  • supplementary-data3-TEcomposition_per_species.txt is generated at stage 2.
  • supplementary-data4-retained_hits.txt is generated stage 12.

Output of the pipeline published in Muller et al. (2025)

The final output corresponds to results of the publication (please see the publication for their description).

  • Figure 2A and Figure S3 are produced in script 16-TEcompositionAndEvolution.R
  • Figure 2B is produced in script 14-showHTTonTree.R
  • Figure 3A and Figure S8 are produced in script test_lifeStyle.R
  • Figure 4A is produced in test_phylogeneticProximity_global.R
  • Figure 4B-G, Figure 5, Figure S6 , Figure S10 and Figure S11 are produced in script test_bayesian.R
  • Figure S2 is produced in script 11-hitGroupEvaluation.R
  • Figure S7 is produced in script test_habitatSharing.R
  • Figure S9 is produced in script 15-testHTTexcess.R
  • Supplemental Dataset S3 is produced in script 12-HTTeventCount.R
  • Supplemental Dataset S4 is produced in script 11-hitGroupEvaluation.R
  • Supplemental Dataset S5 is produced in script test_lifeStyle.R
  • Supplemental Dataset S6 is produced in script test_bayesian.R

About

scripts used to analyse horizontal transfer and evolution of transposable elements in 307 vertebrate species

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • R 99.1%
  • Shell 0.9%