Skip to content
inti edited this page Aug 31, 2011 · 16 revisions

Description

This program allows to perform gene-set analyses with the results of the Connectivity Map project. Please see the CMAP project website for details of its nature www.broadinstitute.org/cmap/.

Details

User must provide a file with a list of genes and a associated tag (e.g., genes associated with a disease). The program will:

  • rank drugs based on the differential expression of these genes compared with all other tested.
  • rank diseases based on known drugs used to treat them, i.e. disease whose drugs affect the expression of the input genes.
  • Similarly to rank disease it will provide a rank of mechanisms of known action for the drugs.

These ranks can either be calculated as a test of differential expression (one sided or two sided test) or as Area Under the Curve (AUC) value. AUC values have the advantage of providing natural measure of how well the expression profile of the input genes classify diseases and drugs machanisms of action. Test of differential expression are probably more suited is the objective of the analysis is to rank drugs. We provide a drug-to-disease and drug-to-mechanism_of_action mapping based on the Clinical Trials (http://clinicaltrials.gov/) and the MeSH annotations (www.nlm.nih.gov/mesh/). However, used can provide additional mapping for the drugs identifiers, e.g., diseases classes like nervous system or immune-disease, main target tissue or associated side effects.

Program dependencies

In order to run the program you will need R (www.r-project.org) and the following R packages, all of which are available at CRAN (cran.r-project.org):

  • plyr
  • ggplot2
  • pROC
  • optparse
  • fdrtool

Download

You can download the program folder compressed with zip or tar from this page. The sour code is available here and you are welcome to develop new functions. The development branch is here. If you have git installed on you computer you can get the latest version and update the code with the following commands: Get code:

git clone git://github.com/inti/CMAP_GSA.git master

or update the code

cd CMAP_GSA.git
git pull origin master

Examples:

1 - Gene-set analyses with Crohn's disease SNP associated via GWAS.

This example will run the analysis for the genes mapped to SNPs (within 20 kb of the SNP) reaching genome-wide significance (p-values < 5x10-8) on GWAS for Crohn's disease reported on the GWAS catalogue. We will obtain AUC values indicating drugs for which disease change the expression patterns of CD associated genes. The command line and screen output are:

./cmap_gsa.R --gene_file example/crohn_quote_s_disease_snps.txt -o crohn_quote_s_disease --auc --n_perm 10 --hugo_id --lfdr --collapse_probes_mean --plot --gsa
[1] Tue Aug 30 15:43:40 2011        Loading CMAP data from [ data/CMAP.RData ]. It may take a few minutes.
[1] Tue Aug 30 15:45:19 2011        Loading AFFY probes to gene mapping to parse CMAP data from [ data/probes_annotation.txt ]
[1] Tue Aug 30 15:45:19 2011        Filtering out probes matching to multiple genes
[1] Tue Aug 30 15:45:20 2011        Reading conditions for genes from [ data/trials_4_cmap_drugs.txt ]
[1] Tue Aug 30 15:45:21 2011        Reading gene-sets of to be analysed from [ example/crohn_quote_s_disease_snps.txt ]
[1] Tue Aug 30 15:45:21 2011        Collapsing probe values at gene level using method [ Average ].
[1] Tue Aug 30 15:48:23 2011        Running PAGE analysis
[1] Tue Aug 30 15:48:23 2011           '-> WIll run [ 10 ] permutations to calculate the statistic under the null
  |======================================================================| 100%
[1] Tue Aug 30 15:48:29 2011           '-> Writing GSA results to [ crohn_quote_s_disease.PAGE.txt ]
[1] Tue Aug 30 15:48:29 2011        Calculating AUC values for each gene-set
  |======================================================================| 100%
[1] Tue Aug 30 15:48:29 2011           '-> For each condition I will calculate AUC values based on the [ empirical_Z ] column
  |======================================================================| 100%
[1] Tue Aug 30 15:48:33 2011        Writting output files and plots to [ crohn_quote_s_disease.condition.* ] with extensions txt and pdf, respectively 
  |======================================================================| 100%
[1] Tue Aug 30 15:48:34 2011        Analysis finished.

This will create three output files: crohn_quote_s_disease.PAGE.txt, crohn_quote_s_disease.txt and crohn_quote_s_disease.crohn_quote_s_disease.pdf. The crohn_quote_s_disease.PAGE.txt has the gene-set enrichment results with the corresponding local False Discovery Rate values to select drugs which significantly change the expression of the genes in the gene-set. The crohn_quote_s_disease.txt file has the actual AUC values and their confidence intervals. The AUC values were calculated for the drug rank formed by the PAGE results. The *.pdf file has a plot that allows for a quick inspection of the results: the plot presents AUC values and their 95 % confidence intervals on the y-axis and conditions on the x-axis.

Program options

By running the following command line you will obtain the program's options

./cmap_gsa.R --help
usage:  usage: ./cmap_gsa.R [options]

options:
-v, --verbose
	Print extra output [default]

-q, --quietly
	Print little output

-o NULL, --outfile=NULL
	Output file Name

-d DISTANCE, --distance=DISTANCE
	Max SNP-to-gene distance allowed (in kb)

--gene_file=GENE_FILE
	File with genes to be analyse

--clinical_trials=CLINICAL_TRIALS
	File with gene conditions

--auc
	Calcukate Area Under the Curve

-b, --bootstraping
	Set AUC calculation on bootstraping mode. Default: DeLong method.

--correct_ci
	Set correction for multuple testing on confidence interval of AUC values

--n.boot=N.BOOT
	Number of bootstraping replicates to calculate AUC values

--ci_boot=CI_BOOT
	Condidence interval value for AUC values

--plot
	Make plots of results. It need library ggplot2 installed

--gsa
	Run enrichemnt analysis using PAGE method

--n_perm=N_PERM
	Number of permutation to calculate the null distribution and significance for  enrichmet results

--one_sided
	Enrichment test is one sided. Differential regulation regarding is up or down

--two_sided
	Enrichment test is two sided. Test both up and down regulation separately

--lfdr
	Calculate local FDR values for PAGE results

--hugo_id
	User gene ids are Hugo symbols

--ensembl_id
	User gene ids are ensembl gene ids

--collapse_probes_mean
	Collapse probe level information into a single gene-leve value by averaging individual probe values

--collapse_probes_abs_max
	Choose the row with the highest absolute value

--collapse_probes_abs_min
	Choose the row with the lowest absolute value

--collapse_probes_eigen
	Choose the row with the lowest absolute value

-h, --help
	Show this help message and exit

References

Lamb J, Crawford ED, Peck D et al. 2006. The Connectivity Map: using gene-expression signatures to connect small molecules, genes, and disease. Science; 313, 1929-1935.

BUGS and questions

In case of bugs or question please e-mail Inti Pedroso intipedroso@gmail.com or Gerome Breen gerome.breen@gmail.com

TODO

  • Document options
  • Add files with mechanism of action