Globally-coherent datasets (GCDs) contain (at least) three levels of information (i) genome-wide DNA variation, (ii) an intermediate trait, as well as (iii) a (clinical) phenotype. Intermediate traits are typically gene expression, but may also include proteomic, metabolomic, and other molecular data. These data sets make it possible to dissect how a genomic perturbation (e.g. a somatic copy-number alteration) leads to changes in cellular networks and pathways that then shape the phenotype (e.g. how aggressive a type of cancer is). Examples of GCDs are the The Cancer Genome Atlas, the International Cancer Genome Consortium, the METABRIC project at the CRI in Cambridge, as well as the data collected by SAGE Bionetworks.

The challenge of GCDs is to gain a global understanding of how the different layers of information are connected. Effective statistical methods can provide a systems-level view, while network methods are key in visualizing complex data sets. Together these methods can help to 'boil down' the complex multi-layered GCDs into testable hypotheses for in-depth follow-up studies.

This course consists of lectures and practical sessions. In the lectures you will learn statistics and machine learning concepts to analyze globally coherent data sets from a sytems and network perspective. In the practical sessions you will have the opportunity to try out the new concepts using publically available software and an example data set in breast cancer.


Day 1: Thursday, 28th Oct 2010
In the morning Florian Markowetz will teach you basic concepts underlying current state-of-the-art approaches. In the afternoon you will apply these approaches to a breast cancer data set (supervised by Yinyin Yuan and Mauro Castro).

Day 2: Friday, 29th Oct 2010
In the morning, we will have a general discussion sesssion on the PROs and CONs of the approaches introduced on Day 1, their merits and pitfalls, and how they can best be applied to GSK projects. In the afternoon, we will continue the practical session of Day 1.


Bioconductor R


You will learn how to use software written in R and freely available from CRAN and Bioconductor. You will need R 2.12 and Bioc 2.7. In particular, we will highlight the following packages:

  • iCluster for integrative clustering of multiple genomic data types using a joint latent variable model.
  • BHC combines Dirichlet Processes with hierarchical clustering.
  • GeneNet estimates Gaussian graphical models.
  • HTSanalyzeR provides an integrated interface to enrichment and network analysis.
  • DANCE quantifies the impact of genomic alterations on gene expression and compares it between tumour sub-types.
  • lol contains various optimization methods for matrix-to-matrix Lasso inference.
  • IGIR provides a flexible interface to graphic software.

To install the .tar.gz packages for DANCE and LOL in Windows type in R:
install.packages('PACKAGE.tar.gz', repos=NULL, type='source' )
where PACKAGE stands for the name of the package.


In the practical session of Day 1 we will illustrate the statstical concepts taught in the morning session on a breast cancer data set (Chin et al, 2007) that combines (i) copy-number profiles, (ii) gene expression measurements, and (iii) clinical information for 171 patients.

We collected all data you will need into a .ZIP file (>100MB) which contains ...

Background reading



e: first.last@cancer.org.uk
p: +44 (0) 1223 40 4315


To install the .tar.gz packages for DANCE and LOL in Windows type in R:

In case you run into problems when downloading or installing, please contact Mauro Castro.