Scripts used in "Phylogenetic relatedness rather than aquatic habitat fosters horizontal transfer of transposable elements in animals" by Héloïse Muller, Rosina Savisaar, Jean Peccoud, Sylvain Charlat, Clément Gilbert (doi: 10.1101/gr.280432.125)
This is a branch of the scripts used in "Horizontal transfer and evolution of transposable elements in vertebrates" by Hua-Hao Zhang, Jean Peccoud, Min-Rui-Xuan Xu, Xiao-Gu Zhang, Clément Gilbert (doi: 10.1038/s41467-020-15149-4).
These scripts are publicly available to indicate how parts of the analysis were automated. There is no guaranty regarding their use. For those who want to use the pipeline, see below:
-
R 4.3+ with the following packages:
- data.table 1.16.2
- stringi 1.8.4
- matrixStats 1.4.1
- igraph 2.1.1
- ape 5.8
- seqinr 1.5.1
- RColorBrewer 1.1-3
- dplyr 1.1.4
- ggplot2 3.5.1
(If not found, these packages are installed automatically by the pipeline.)
-
RepeatMasker 4.1
-
BUSCO 5.4
-
MMseq2 13.4511
-
ncbi blast+ 2.12.0
-
diamond 2.0.7
-
seqtk 1.4
The pipeline was not tested with other versions of the above programs, but more recent versions probably work.
Hardware requirements: a linux cluster with
- ≥200 CPUs
- ≥0.5 TB of system memory
- ≥2 TB of free hard drive space
- an internet connection to download the genome sequences (>10 MB/s is recommended).
On this hardware, the pipeline on the 247 animal genomes should take a few months to complete with a similar taxa composition.
The R scripts whose names start with numbers performed successive stages of the analysis. The purpose of each script is described by comments at the beginning of the script.
HTvFunctions.Rcontains functions required for the other scripts and is sourced automatically.circularPlots.Rcontains functions used to draw Figure 2 of the paper.- The remaining scripts are launched via
Rscript(for long, CPU-intensive tasks) from the scripts whose names start with numbers.
The directory additonal_files contains
Zhang2020, which contains files from the paper Zhang et al. (2020)supplementary-data1-genomes_and_accessions.txtgives general information about the genomes and is used to download genomes sequences from ncbi.ftp_links.txtcontains URL to the genome sequences, also used to download genome sequences.timetree.nwkis the timetree (newick format) used through the analysis.namedClades.txtis a table of major vertebrate clades in this tree, with their names and color codes used to make some of the paper's figures (these figures are generated by the scripts).superF.txtmakes the correspondance between repeatModeler family codes (first column), TE class (2nd column) and more common TE superfamily names (3rd column). It is used in stages 15 and 16.supplementary-data3-TEcomposition_per_species.txtis generatd by the scripts and is provided with the paper, but we also provide it here if to facilitate the reproduction of the results.
Muller, which contains files from the paper Muller et al. (2025)- Datasets correspond to the additional datasets of the paper. Because of size limitations, only the first 10,000 lines are included for DatasetS3 and S4, and Dataset S7 is not including. All and whole datasets can be found at https://doi.org/10.5281/zenodo.15297793. DatasetS2 is essential to run the pipeline. It is the phylogenetic tree containing all the species of the dataset. The tips are as follow: species_genus. It has to contain the divergence time, although unprecisions are no big deal as the detection of HT is only based on the topology. The important is to have a correct topology and no polytomy.
- metadata.tbl is an essential input for the pipeline. It is a tab delimited table that has to contain at least these three columns:
- assembly: the name of the assembly (GCA_xxx.1); the fasta should start by this
- species: species name written as "species genus"
- dbBusco: the busco database to use for this genome
- Any additional columns, such as taxonomical or ecological data
Scripts need the following in $PATH:
seqtk, blastn
- The directory
demo_TeKaKsis provided to demo the scriptTEKaKs.R(seeDemonstration of TEKaKs.Rbelow), but is not required to run the pipeline.
01-downloadGenomes.Ris the script used by Zhang et al. (2020) to download genomes from a file containing accession links.01-downloadGenomes_2.R.shis the script used in Muller el al. (2025) to download the selected genomes from the list of all genomes available.
02-ExtractCopies.Rcomments explaining how to annotate TE copies. The script extract these copies.03a-findDubiousTEs.shand03b-findDubiousTEs.Rdetect dubious TE, ie elements annotated as TE that might not be TE04-mapTE.Rpermforms similarity search between TE copies of pairs of species
05-coreGenedS.Rinclude the launch of05bis_coreGenesdNdS.R05tris-FindNamesBusco.shhas to be run manually at line 333 of05-coreGenedS.R.
STEP B & STEP C are independent; they can be run at the same time.
This step can be started only once both STEP B & STEP C are done
06-filterTEhits.Rruns06bis-filterTEblastnHits.Rand06tris-successiveBlastx.R.- Similarly,
07-dNdSAndHTTfilter.Rruns07bis-TEdNdS.R.- 07-08-Precision.Ris an addition that was not present in Zhang et al. (2020). This script has to be run between scripts 07 and 08.
Here we used two independent methods for clustering:
- Clustering per clade, similarly to what was done in Zhang et al. 2020.
- Clustering per pair of species, developed for this study
Both methods have to start with 08-prepareClustering.Rto prepare the clustering. One has to comment, or uncomment, the method they want to use, or not use, at the beginning of the script.
08-prepareClustering.R cannot be run from bash. Starting line 177, one has to read the comments to continue.
ATTENTION Make sure you added "WAIT" between each line, as explained line 181, before running the bash scripts generated by 08-prepareClustering.R.
Then, there are different versions of scripts 09 and 10 depending on the method.
For clustering per clade :
09-hitClusteringRound1_perClade.Rlaunches09bis-iterativeFirstClustering.R10-hitClusteringRound2_perClade.Rlaunch10bis_iterativeSecondClustering.RFor clustering per pairs of species :09-hitClusteringRound1_perPairs.Rdoes not launch other scripts (step of iterativeFirstClustering are integrated in the script)10-hitClusteringRound2_perPairs.Rdoes not launch other scripts either.
Regarding scripts 10, make sure to run clustering 10-hitClusteringRound2_perClade.R before running 10-hitClusteringRound2_perPairs.R, as the steps in common are not repeated. Indeed, 10-hitClusteringRound2_perPairs.R does not regenerate involvedProtOCC200dS05.self.out.
11-hitGroupEvaluation.R can filter out hit groups of low confidence clusterized per clade (in this case, uncomment method <- "perClade") or clusterized per pairs of species (in this case, uncomment method <- "perPairs"). If one wants to run both, the scripts has to be run two times.
Scripts 12 to 16 have to be run on final output of script 10-hitClusteringRound2_perClade.R, i.e. on occ200HitGroup_perclade.txt.
12-HTTeventCount.Restimates the minimum number of independent transfers that took place across the dataset.13-TEdNdSWithinGenomes.Rcomputes dN and dS values for related copies within a genomes.14-showHTTonTree.Rplots a circular phylogeny of all the dataset and shows the independent HTT events estimated in12-HTTeventCount.R.15-testHTTexcess.Rtests whether there is an excess of HTT in some taxonomic groups thanks to a permutation approch.16-TEcompositionAndEvolution.Rgenerates bar plots to illustrate the total number of copies annotated for each TE superfamilies and the total number of HTT events they are involved in. It also compare the dNdS under vertical transmission (values generated in13-TEdNdSWithinGenomes.R) and under HT.
These scripts do not follow any numbering. Those analyses have to be run on the clustering that was done independently for each pairs of species, i.e. on occ200HitGroup_perPairs.txt
test_lifeStyle.Rtests for an excess of transfers in the aquatic habitat.test_habitatSharing.Rtests for an excess of transfers in a shared habitat (i.e. aquatic -> aquatic or terrestrial --> terrestrial )test_phylogeneticProximity_global.Rtests the effect of the phylogenetic proximity globally.test_bayesian.Rruns all Bayesian analyses.
The following is run on the other clustering approach:test_TELossestests for losses of TE.
Adapting this pipeline to other datasets, hardware configuration, and automating all procedures requires modifications to the code. Some parts of the analysis were not automated.
In a bash-compatible terminal that can execute git, paste
git clone https://github.com/HeloiseMuller/HTvertebrates.git
cd HTvertebrates/
We detail how to run TEKaKs.R on a demo dataset, but we remind that this script (as all others) is not intended for use in any other context than the study associated with the paper.
This script computes Ka, Ks, and well as overall molecular distances on pairs of homologous transposable elements (TEs), based on HSPs between these TEs and on HSPs between TEs and proteins (HSPs are not generated by this script). See the paper's method section for a description of the approach.
The hardware requirement for this demo is a Mac/Unix/Linux computer with at last 8GB of RAM, 1GB of free hard drive space, and which is able to execute R 3.4+ in a terminal. A Windows computer cannot run the script as some R functions are not supported under windows (namely those of the parallel R package).
A Mac computer may have issues installing the igraph R package from sources, as macOS lacks a fortran compiler. However, the igraph package may be installed manually by specifying NOT to install packages from sources (which is not possible to do via Rscript).
The other programs mentioned in the Requirements section need not be installed for this demo.
The demo_TeKaKs/ directory must be immediately within the HTvertebrates/ directory. It contains the following:
TEhitFile.txtis a file of TE-TE HSPs in typical blast tabular format, but only listing sequence names and HSP coordinates.blastxFile.txtis a tabular file of TE-protein HSPs. The fields indicate the TE sequence name, start and end coordinates of the HSP on this sequence (where start < end), start coordinate of the HSP on the protein and whether the TE sequence in aligned on the protein in reverse direction.fastaFile.fasis a fasta file of the TE sequences whose names are in the two aforementioned files.
The nature of these files is also explained by comments in TEKaKs.R.
To run the demo, paste the following in the terminal session that you used to install the pipeline:
Rscript TEKaKs.R demo_TeKaKs/TEhitFile.txt demo_TeKaKs/blastxFile.txt demo_TeKaKs/fastaFile.fas demo_TeKaKs/output 2
where demo_TeKaKs/output is the output folder (automatically created) and 2 is the number of CPUs to use.
The script should run in less than 5 minutes on a standard desktop PC.
Results will be found in demo_TeKaKs/output/allKaKs.txt. This tabular file contains the following fields:
hitis an identifier for each HSP, which corresponds to the row index of each HSP inTEhitFile.txt.ka,ks,vkaandvksare the results of Ka and Ks computations (see thekaks()function ofseqinr),lengthis the length of the alignment on which the above were computed.nMutis the number of substitutions in this alignment.K80distanceandrawDistanceare molecular distances (according to Kimura 1980 or without any correction) between sequences in the HSP. These are computed before any of the processing required for the Ka Ks computations (the removal of certain nucleotides and codons, see the method section of the paper).
More than 1TB of intermediate files are generated. The final output corresponds to results of the publication (please see the publication for their description).
Figure2.pdf,Figure3.pdfandFigure4.pdfare produced at scripts 14, 15 and 16 respectively. They correspond to figures of the main textFigureS1.pdfis generated at stage 5. It corresponds to supplementary figure 1.FigureS2.pdfis generated at stage 11. It corresponds to supplementary figure 2.FigureS[3-6].pdfare generated at stage 16. They corresponds to supplementary figures 3-6.tableS1.txtis generated at stage 15. It corresponds to supplementary table 1.tableS2.txtis generated at stage 16. It corresponds to supplementary table 2.supplementary-data3-TEcomposition_per_species.txtis generated at stage 2.supplementary-data4-retained_hits.txtis generated stage 12.
The final output corresponds to results of the publication (please see the publication for their description).
Figure 2AandFigure S3are produced in script16-TEcompositionAndEvolution.RFigure 2Bis produced in script14-showHTTonTree.RFigure 3AandFigure S8are produced in scripttest_lifeStyle.RFigure 4Ais produced intest_phylogeneticProximity_global.RFigure 4B-G,Figure 5,Figure S6,Figure S10andFigure S11are produced in scripttest_bayesian.RFigure S2is produced in script11-hitGroupEvaluation.RFigure S7is produced in scripttest_habitatSharing.RFigure S9is produced in script15-testHTTexcess.RSupplemental Dataset S3is produced in script12-HTTeventCount.RSupplemental Dataset S4is produced in script11-hitGroupEvaluation.RSupplemental Dataset S5is produced in scripttest_lifeStyle.RSupplemental Dataset S6is produced in scripttest_bayesian.R