Globally-coherent datasets (GCDs) contain (at least) three levels of information (i) genome-wide DNA variation, (ii) an intermediate trait, as well as (iii) a (clinical) phenotype. Intermediate traits are typically gene expression, but may also include proteomic, metabolomic, and other molecular data. These data sets make it possible to dissect how a genomic perturbation (e.g. a somatic copy-number alteration) leads to changes in cellular networks and pathways that then shape the phenotype (e.g. how aggressive a type of cancer is). Examples of GCDs are the The Cancer Genome Atlas, the International Cancer Genome Consortium, the METABRIC project at the CRI in Cambridge, as well as the data collected by SAGE Bionetworks.
The challenge of GCDs is to gain a global understanding of how the different layers of information are connected. Effective statistical methods can provide a systems-level view, while network methods are key in visualizing complex data sets. Together these methods can help to 'boil down' the complex multi-layered GCDs into testable hypotheses for in-depth follow-up studies.
This course consists of lectures and practical sessions. In the lectures you will learn statistics and machine learning concepts to analyze globally coherent data sets from a sytems and network perspective. In the practical sessions you will have the opportunity to try out the new concepts using publically available software and an example data set in breast cancer.
Day 1: Thursday, 28th Oct 2010
In the morning Florian Markowetz will teach you basic concepts underlying current state-of-the-art approaches. In the afternoon you will apply these approaches to a breast cancer data set (supervised by Yinyin Yuan and Mauro Castro).
Day 2: Friday, 29th Oct 2010
In the morning, we will have a general discussion sesssion on the PROs and CONs of the approaches introduced on Day 1, their merits and pitfalls, and how they can best be applied to GSK projects. In the afternoon, we will continue the practical session of Day 1.
You will learn how to use software written in R and freely available from CRAN and Bioconductor. You will need R 2.12 and Bioc 2.7. In particular, we will highlight the following packages:
- iCluster for integrative clustering of multiple genomic data types using a joint latent variable model.
- BHC combines Dirichlet Processes with hierarchical clustering.
- GeneNet estimates Gaussian graphical models.
- HTSanalyzeR provides an integrated interface to enrichment and network analysis.
- DANCE quantifies the impact of genomic alterations on gene expression and compares it between tumour sub-types.
- lol contains various optimization methods for matrix-to-matrix Lasso inference.
- IGIR provides a flexible interface to graphic software.
To install the .tar.gz packages for DANCE and LOL in Windows type in R:
install.packages('PACKAGE.tar.gz', repos=NULL, type='source' )
where PACKAGE stands for the name of the package.
In the practical session of Day 1 we will illustrate the statstical concepts taught in the morning session on a breast cancer data set (Chin et al, 2007) that combines (i) copy-number profiles, (ii) gene expression measurements, and (iii) clinical information for 171 patients.
We collected all data you will need into a .ZIP file (>100MB) which contains ...
- How to understand the cell by breaking it: network analysis of gene perturbation screens F. Markowetz. PLoS Comp Bio, 2010 Feb 26;6(2):e1000655.
- Inferring cellular networks - a review F. Markowetz, R. Spang. BMC Bioinformatics, 8(Suppl 6):S5, 2007
R/BHC: fast Bayesian hierarchical clustering for microarray data
Savage RS, et al. BMC Bioinformatics. 2009 Aug 6;10:242.
Integrative clustering of multiple genomic data types using a joint latent variable model with application to breast and lung cancer subtype discovery
Shen R, Olshen AB, Ladanyi M. Bioinformatics. 2009 Nov 15;25(22):2906-12.
- BioNet: an R-Package for the functional analysis of biological networks. Beisser D, et al. Bioinformatics. 2010 Apr 15;26(8):1129-30.
- HTSanalyzeR: a R/Bioconductor package for integrated network analysis of high-throughput RNAi screens C. Terfve, J.C. Rose, X. Wang, F. Markowetz. submitted
An integrative genomics approach to infer causal associations between gene expression and disease
Schadt EE et al. Nat Genet. 2005 Jul;37(7):710-7.
Harnessing naturally randomized transcription to infer regulatory relationships among genes
Chen LS, Emmert-Streib F, Storey JD. Genome Biol. 2007;8(10):R219.