Skip to content

Full Documentation

EduardoGadeGusmao edited this page Apr 23, 2019 · 6 revisions

This page shows a full documentation on all the functions of this repository. With the tools from this repository you can re-create the plots of Gothe et al. paper:

Placeholder

and also use them to analyze your own data!

Please note that the parameters to all tools are presented as 'options' and can be presented in any order. However, ALL of the input options need to be given (with a few exception, such as numerical parameters, that have default values).

Functions:

Gene metaplots:

This program creates a meta-plot of a gene based on the signals and regions given. This program takes the following parameters:

  • --nBins: Number of bins in which the meta-plot region (6Kbp) is going to be divided to average the signal. The original was set to 15.
  • --tssExt: The size, in bp, of promoter regions. The original was set to 2Kbp.
  • --bamCount: The total number of valid reads in the BAM file containing the signal to plot.
  • --aliasFileName: File containing gene aliases. For more information see bellow.
  • --genesFileName: A file containing the location of genes. In this particular case the format has to be UCSC's refseq table.
  • --expressionList: A plain text (tab-separated) file containing the genes in the first column and their expression in the second column.
  • --bamFileName: A BAM file containing the signal in which the meta-plot will be calculated.
  • --temp: Temporary location to aid in the execution.
  • --outputFileName: Output file name.

Correlation between double strand breaks and genomic / epigenetic / regulatory features:

This program creates scatterplots and lineplots with the correlation of user-defined features and DSBs. This program takes the following parameters:

  • --half-ext: Half the distance (in bp) from the middle of the feature to calculate the correlation.
  • --regions: A file containing the regions in which the features will be centered at to calculate the correlation.
  • --signal-label-list: A comma-separated list of labels for each signal (features) to be plot.
  • --signal-count-list: A comma-separated list containing the total read count of each signal's (features's) BAM file.
  • --signal-file-list: A comma-separated list of BAM files for each signal (features) to be plot.
  • --temp: Temporary location to aid in the execution.
  • --output-file: Output file name.

Heatmaps of genomic / epigenetic and regulatory features ordered by double strand break intensity:

This program creates a centered around the features given for the signal also given. Furthermore, it sorts the heatmap in decreasing order by intensity of a given list (e.g. DSBs). This program takes the following parameters:

  • --half-ext: Half the distance (in bp) from the middle of the feature to plot the heatmap.
  • --regions: A bed file in which the final heatmap will be sorted by its SCORE column.
  • --signal-file: A BIGWIG file containing the signal in which to generate the heatmap.
  • --signal-label: A label which will be plotted with the heatmap of the signal.
  • --temp: Temporary location to aid in the execution.
  • --output-file: Output file name.

Correlation between expression, double strand breaks and chromatin conformation:

This program calculates:

  • Distances from a list of genes to their closest anchors.

  • The expression of these genes.

  • The proportion of double-strand breaks (DSBs) in these genes. And plots pairwise (2D) correlation plots and a 3D triple correlation plot. This program takes the following parameters:

  • --max-dist: The maximum distance a gene can be from its nearest loop anchor. You should set this to an integer twice as much as you expect in the final plot.

  • --alias-file: The alias file is given with the package and simply contains multiple collected aliases for the same gene. For more information see bellow.

  • --chrom-sizes: This is a tab-separated file containing the chromosome names and their lengths. For more information see bellow.

  • --genes-file: A bed file containing all genes in which the analysis will be based on.

  • --expression-file: It can be a tab-separated file containing the gene names and their expression; or a bam file in which the expression will be calculated.

  • --dsb-file: A bed or a bam file containing the double strand breaks' positions.

  • --distance-file: At the moment, the tool only supports loops called from HiCCUPS. CTCF annotation is not required but highly recommended as in Rao et al.'s file in the example bellow.

  • --temp-loc: A path in which the program will store all temporary files. It can be erased after the execution; however the program itself won't erase this path as it might be useful for troubleshooting.

  • --output-location: The output location in which the tables and figures will be created in.

Double strand breaks behavior at chromatin loops:

This program replicates the CTCF-DSB plots present in the original paper. This program takes the following parameters:

  • --region-type: The type of CTCF file generated. This option is important to ensure a correct and fast analysis. This flag can be one of the followings:

    O_FR: The --ctcf-file overlaps active genes whose promoters are upstream of the CTCF. Forward & reverse strand.

    I_FR: The --ctcf-file overlaps active genes whose promoters are downstream of the CTCF. Forward & reverse strand.

    IO_FR: The --ctcf-file overlaps active genes whose promoters are downstream & upstream of the CTCF. Forward & reverse strand.

    IO_F: The --ctcf-file overlaps active genes whose promoters are downstream & upstream of the CTCF. Only in the forward strand.

    IO_R: The --ctcf-file overlaps active genes whose promoters are downstream & upstream of the CTCF. Only in the reverse strand.

    intergenic: The --ctcf-file does not overlap genes or promoters.

    inactive: The --ctcf-file overlaps inactive genes.

  • --alias-file: File containing gene aliases. For more information see bellow.

  • --gene-file: A simple BED file containing the location of genes.

  • --ctcf-file: A file containing the particular genes (or other elements) that overlapped a CTCF factor.

  • --expression-file: A plain text (tab-separated) file containing the genes in the first column and their expression in the second column.

  • --dsb-file: A BAM file containing all the DSBs.

  • --temp: Temporary location to aid in the execution.

  • --output-file: Output file name.

Common file types:

In this section we will describe in details common file types and how they should be presented.

Alias file:

The alias file represents a resource on different gene nomenclatures. The alias file contains three tab-separated columns. The first column contains the gene name in ENSEMBL_ID. The second column contains the human-readable gene symbol (gene name, such as 'CTCF') preferred to be displayed in plots. And the third column contains a '&'-separated list containing all the possible identifiers (aliases) for that gene in different databases. A very complete version for the human genome (version hg19) can be find in the 'data' folder of this repository.

Chromosome sizes:

The chromosome sizes file is a tab-separated file containing the chromosome names in the first column and the length of that chromosome in bps in the second column. An example (without the chromosomes Y and M; and contigs) can be found in the 'data' folder of this repository.

BED, BAM and BIGWIG files:

The following link explains these files in details: UCSC Genomic Formats

UCSC's Refseq genes:

To generate this file containing the refseq genes and their locations with the correct columns that will be recognized by this tool please visit:

UCSC Table Browser

In the field called "table" please put "RefSeq All (ncbiRefSeq)" and then simply click in the button "get output".

Further help, comments, questions and bug report

For further help, comments, questions and reporting any bug in the code, please send an email to the bioinformatician in charge of this repository:

Dr. Eduardo Gade Gusmao

eduardogade@gmail.com