The mixtR package (Matched Interactions across Tissues) contains the computational and statistical methods for identifying and exploring associations between sets of genes or molecular processes across tissues.
The mixtR analysis requires gene expression profiles of two matched tissues and
is based on two core functions:
- sig.ranksum
a function that maps samples to a linear ordering based on
expression of a set of genes of interest
- stat.ranksum
a function that derives estimates of significance for
association between gene sets of interest across tissues.
The mixtR package also offers ad-hoc plotting functions (cohort_heatmap
,
cohort_scatterplot
, cohort_boxplot
) for visualization of the results.
If the user wants to build a web application to explore/disseminate the results,
the outputs should be saved in the /data
folder of the mixtApp R package.
Instructions to build the web application are described in the MIxT web
application vignette.
First load the mixtR package in your R session.
library(mixtR)
In this vignette, we are taking the example of gene expression profiles from microdissected epithelium and stroma from non-inflammatory invasive breast tumors from boersma et al.
Those data are stored in the /data
folder of the mixtApp R package and are accessed through RawGit that provides a static reference to the files.
files= c("dat.rda", "moduleColors.rda") dataDir <- 'https://rawgit.com/vdumeaux/mixtApp/master/data/' files.dir=paste0(dataDir,files) for(i in 1:length(files)){ load(url(files.dir[i])) }
Expression and clinical data from matched tissues were formatted so that they
are contained in a list named dat
.
Each object of the list must include the following objects for each tissue:
exprs
an expression matrix with m genes in rows and n patients in columns.
Rownames of exprs
are the same gene identifiers as the one used to define gene sets.
cohorts
list of patients subgroups. If you have only one , the vector
should still be present
clinical
(optional) a clinical data frame with n patients in rows and
clincal variables in columns. Categorical variables should be character or
factors. Continuous variables should be numeric.
clinical.colors
(optional) a clinical data frame with n patients in rows and
clincal variables in columns coded by discrete or continuous color-coding.
We recommend the colorspace R package that provides perceptually-based color
palettes for coding categorical data (qualitative palettes) and numerical
variables (sequential and diverging palettes).
str(dat)
moduleColors
is a list of length 2: each object is a named vector of gene set
appartenance for each tissue. In our example, each gene is part of a module
grouping strongly co-expressed genes together as computed by WGCNA R package.
Other clustering approaches could replace this step. In general any approach
that moves the focus from a single gene to a “gene set” could suffice; eg it
could be groups of genes involved in different pathways/molecular processes.
str(moduleColors)
IF you work with gene networks, you can import the top node and edges for each network that you can vizualize in the MIxT web application.
load(url("https://rawgit.com/vdumeaux/mixtApp/master/data/net.rda")) str(net)
sig.ranksum()
allows you to map 𝑛 samples to a linear ordering based on
expression of 𝑘 genes within a given gene set/module.
If you know the directionality of gene expression, the gene set of interest can
be split between the up
and dn
genes.
If you do not know the directionality of gene expression, all genes in the gene
set are listed in the ns
argument. In this case, genes will be partitioned into two
groups around myeloids using correlation as the distance metric.
The patient ordering is then partition in three categories (high, mid, low)
following a random sampling procedure (eg n
= 10,000). By default, the region of
independence (ROI) is determined by the 0.025 and 0.975 percentile point of
the distribution of random patient ranks (middle.range = 0.95).Both n
and
middle.range
can be set differently.
gene.sets <- lapply(moduleColors, function(x.moduleColors){ split(names(x.moduleColors), factor(x.moduleColors)) }) str(gene.sets)
## compute ranksum for each gene sets in each tissue bresat <- list() bresat$epi <- lapply(gene.sets$epi, function (mod){ sig.ranksum (x.dat = dat$epi, ns = which(rownames(dat$epi$exprs) %in% mod), n=10000, mc.cores = 2)}) bresat$stroma <- lapply(gene.sets$stroma, function (mod){ sig.ranksum (x.dat = dat$stroma, ns = which(rownames(dat$stroma$exprs) %in% mod), n=10000, mc.cores = 2)})
save(bresat, file = "../../mixtApp/data/bresat.rda")
load(url("https://rawgit.com/vdumeaux/mixtApp/master/data/bresat.rda"))
names(bresat$epi) str(bresat$epi$blue)
You can plot heatmap of a given gene set expressed in a given tissue and reordered by the output of sig.ranksum().
cohort_heatmap(mixt.dat = dat, mixt.ranksum = bresat, tissue = "epi", module = "darkgrey", cohort.name = "all", cl.height = 3)
The MSigDB database from the Broad Institute provides multiple collections of
gene sets that can be used to functionally annotate the gene modules in each
tissue.
Enrichment for each gene signature is estimated for all genes in each module and
separately for genes that are positively (red genes up) or negatively (blue genes dn)
correlated with the patient ranksum in each cohort using the hypergeometric
minimum-likelihood P-values,computed with the function ‘dhyper’
(equivalent to one-sided Fisher exact test).
P-values can then adjusted for multiple testing choosing p.adjust.method
(see ?stats::p.adjust to review methods available).
MSigDB cannot be accessed without a free registration, so the first things
you'll need to do is download the .gmt files using the MSigDB page and save it
in a folder named extdata
for example.
read.gmt.msigdb()
will import your .gmt files in your working
environment as follow
msigdb.files <- c("c1.all", "h.all", "c2.cp", "c2.cgp", "c5.all", "c6.all", "c7.all") msigdb <- list() msigdb <- lapply(msigdb.files, function(p) { input.file <- file.path("../../extdata", paste0(p, ".v5.2.symbols.gmt")) ret <- read.gmt.msigdb(input.file) return(ret) }) names(msigdb) <- msigdb.files
We choose to use the hallmark (h), poistional (c1) and curated (c2) gene set collections.
sig<-list() sig$all.sig<-c(msigdb$h.all, msigdb$c1.all, msigdb$c2.cp, msigdb$c2.cgp) sig$set<-c(rep("h", length(msigdb$h.all)), rep("c1", length(msigdb$c1.all)), rep("c2.cp", length(msigdb$c2.cp)), rep("c2.cgp", length(msigdb$c2.cgp))) msigdb.enrichment <- list() msigdb.enrichment$epi<- functional.enrichment(mixt.ranksum = bresat, tissue="epi", cohort.name = "ern", functional.groups = sig, p.adjust.method = "BH", mc.cores=80) msigdb.enrichment$stroma<- functional.enrichment(mixt.ranksum = bresat, tissue="stroma", cohort.name = "ern", functional.groups = sig, p.adjust.method = "BH", mc.cores=80)
save(msigdb.enrichment, file = "../../mixtApp/data/msigdb.enrichment.rda")
load(url("https://rawgit.com/vdumeaux/mixtApp/master/data/msigdb.enrichment.rda"))
library(plyr) tissue.names<-c("epi", "stroma") set<-levels(factor(msigdb.enrichment$epi$turquoise$results$sig.set)) results<-list() for (tissue in tissue.names){ results[[tissue]]<-lapply(msigdb.enrichment[[tissue]], function(x) x$results) } msig.results<-list() msig.results<-sapply(tissue.names, function(tissue) { ret <- lapply(names(results[[tissue]]), function(mod) { res<-do.call(rbind, lapply(set, function(s){ x<-results[[tissue]][[mod]][results[[tissue]][[mod]]$sig.set==s,] x<-cbind(data.frame(name=rownames(x)), x) top.5<-x[order(as.numeric(as.character(x$updn.pval)), decreasing=F),][1:5,]})) }) names(ret)<-names(results[[tissue]]) return(ret) }, simplify=F) epi.msig<-ldply(msig.results$epi) stroma.msig<-ldply(msig.results$stroma) knitr::kable(epi.msig[1:6,]) knitr::kable(stroma.msig[1:6,])
write.table(epi.msig, file = "../output/epi.msig.txt", sep="\t", col.names = T, quote = FALSE) write.table(stroma.msig, file = "../output/stroma.msig.txt", sep="\t", col.names = T, quote = FALSE)
We also can annotate gene sets/modules based on gene ontology (GO) enrichment, for example using the R topGO package.
library(topGO) library(org.Hs.eg.db) library(plyr) bckg.genes <- rownames(dat[[1]]$exprs) xx <- annFUN.org("BP", mapping = "org.Hs.eg.db", ID = "symbol") head(xx) xx <- unique(unlist(xx)) xx<-xx[xx %in% bckg.genes] goterms <- NULL goterms <- lapply(tissue.names, function (tissue){ ret <- lapply(names(gene.sets[[tissue]]), function(mod) { gene.list <- gene.sets[[tissue]][[mod]] gene.list.2 <- ifelse(xx %in% gene.list, 1, 0) gene.list.2 <- factor(gene.list.2) names(gene.list.2) <- xx GO.genes <- new("topGOdata", ontology = "BP", allGenes = gene.list.2, nodeSize = 5, annot = annFUN.org, mapping = "org.Hs.eg.db", ID = "symbol") resultFisher <- runTest(GO.genes, algorithm = "classic", statistic = "fisher") weight01Fisher <- runTest(GO.genes, statistic = "fisher") allRes <- GenTable(GO.genes, classicFisher = resultFisher, weight01Fisher= weight01Fisher, orderBy = "weight01Fisher", topNodes=length(usedGO(GO.genes)), numChar = 1000) allRes <- subset(allRes, classicFisher != "1.00000") gt <- genesInTerm(GO.genes, allRes$GO.ID) mod.genes <- gene.sets[[tissue]][[mod]] common<-lapply(gt, function(sig) intersect(mod.genes, sig)) results <- list(GO.table = allRes, common = common) return(results) }) names(ret) <- names(gene.sets[[tissue]]) return(ret) }) names(goterms) <- tissue.names goterms.table <- list() for (tissue in c("epi", "stroma")){ goterms.table[[tissue]] <- lapply(goterms[[tissue]], "[[", "GO.table") goterms.table[[tissue]] <- lapply(goterms.table[[tissue]], function(x){ dplyr::slice(x, 1:20)}) } epi.goterms<-ldply(goterms.table$epi) stroma.goterms<-ldply(goterms.table$stroma)
write.table(epi.goterms, file = "epi.goterms.txt", sep="\t", col.names = T) write.table(stroma.goterms, file = "stroma.goterms.txt", sep="\t", col.names = T)
save(goterms, file = "../../mixtApp/data/goterms.RData")
load(url("https://rawgit.com/vdumeaux/mixtApp/master/data/goterms.rda"))
Using ranksums to capture module expression, we can ask how gene sets/modules in each tissue are differentially expressed according to patient’s clinicopathological variables. The type of the clinicopathological attribute (categorical or continuous) determines the underlying statistical test. Pearson correlation (Student asymptotic p-value) is used to test association between a given module and continuous patient attributes (eg. age). Analysis of Variance (ANOVA) is used to test association between a given module and categorical patient attributes (eg. ER status).
Significance of the associations is computated using permutations. To determine association between modules within a given subtype, we first stratify on subtypes and then rerun the MIxT procedure: (i) computation of ranksums in each module, (ii) calculation of the association test and (iii) permutations to estimate significance of associations.
epi.clinical.assoc <- lapply(names(dat$epi$cohorts), function(cohort.name) { clinical.ranksum(mixt.dat = dat, mixt.ranksum = bresat, tissue="epi", cohort = cohort.name, nRuns = 10000, mc.cores = 80) }) names(epi.clinical.assoc) <- names(dat$epi$cohorts) knitr::kable(epi.clinical.assoc$ern[, 1:6]) stroma.clinical.assoc <- lapply(names(dat$stroma$cohorts), function(cohort.name) { clinical.ranksum(mixt.dat = dat, mixt.ranksum = bresat, tissue="stroma", cohort=cohort.name, nRuns = 10000, mc.cores = 80) }) names(stroma.clinical.assoc) <- names(dat$stroma$cohorts) knitr::kable(stroma.clinical.assoc$all[, 1:6]) ## Adjust for mutliple testing emp.p <- list() emp.p <- NULL for (cohort.name in names(dat$epi$cohorts)){ emp.p[[cohort.name]] <- cbind(epi.clinical.assoc[[cohort.name]], stroma.clinical.assoc[[cohort.name]]) } ### Families of tests group1 <- c("her2", "HER2S") group2 <- c("er", "LUMS", "ERS") singles <- rownames(emp.p$all)[!rownames(emp.p$all) %in% c(group1, group2) ] gp <- list(group1, group2) mod_clinical_fdr <- NULL mod_clinical_fdr <- lapply(emp.p, function(p.dat){ ret <- do.call(rbind, lapply(gp, function(gpx) { matrix(p.adjust(p.dat[rownames(p.dat) %in% gpx, ], method="BH"), nrow=length(which(rownames(p.dat) %in% gpx)), ncol = ncol(p.dat), dimnames = list(rownames(p.dat)[rownames(p.dat) %in% gpx], colnames(p.dat)) ) })) ret <- rbind(ret, t(apply(p.dat[rownames(p.dat) %in% singles, ], 1, p.adjust, method="BH"))) return(ret) }) mod_clinical_fdr <- lapply(mod_clinical_fdr, function(val){ return(list(epi=val[, 1:23], stroma=val[,24:44])) })
save(mod_clinical_fdr, file="../../mixtApp/data/mod_clinical_fdr.RData")
load(url(("https://rawgit.com/vdumeaux/mixtApp/master/data/mod_clinical_fdr.rda")))
str(mod_clinical_fdr) head(mod_clinical_fdr$all)
We first exclude genes that are not present in both tissue and check significance of overlap between modules across tissue using a Fisher exact test.
overlapFun <- function (labels1, labels2) { labels1 <- as.vector(labels1) labels2 <- as.vector(labels2) levels1 <- sort(unique(labels1)) levels2 <- sort(unique(labels2)) n1 <- length(levels1) n2 <- length(levels2) countMat <- matrix(0, n1, n2) pMat <- matrix(0, n1, n2) for (m1 in 1:n1) for (m2 in 1:n2) { m1Members = (labels1 == levels1[m1]) m2Members = (labels2 == levels2[m2]) t <- table(m1Members, m2Members) # make sure all levels are included tab <- matrix(0, 2, 2) tab[match(rownames(t), c(FALSE, TRUE)), match(colnames(t), c(FALSE, TRUE))] <- t pMat[m1, m2] = fisher.test(tab, alternative = "greater")$p.value countMat[m1, m2] = sum(labels1 == levels1[m1] & labels2 == levels2[m2]) } dimnames(pMat) = list(levels1, levels2) dimnames(countMat) = list(levels1, levels2) pMat[is.na(pMat)] = 1 list(countTable = countMat, pTable = pMat) } str(moduleColors) modColors <- list() modColors$epi <- moduleColors$epi[names(moduleColors$epi) %in% names(moduleColors$stroma)] modColors$stroma <- moduleColors$stroma[names(moduleColors$stroma) %in% names(moduleColors$epi)] str(modColors) gene.set.overlap <- overlapFun(modColors[[1]], modColors[[2]]) str(gene.set.overlap)
P-values are adjusted for multiple testing using the false discovery rate of (Benjamini and Hochberg, 1995).
gene.set.overlap$pTable <- matrix( p.adjust(gene.set.overlap$pTable, method = "BH"), nrow = nrow(gene.set.overlap$pTable), ncol = ncol(gene.set.overlap$pTable), dimnames = list(rownames(gene.set.overlap$pTable), colnames(gene.set.overlap$pTable)) ) knitr::kable(gene.set.overlap$pTable[1:10, 1:8])
Interactions between gene sets/modules across tissues are identified using a random permutation approach based on the Pearson correlation between ranksums of gene expression in gene sets/modules across tissues. This is run independently within each patient cohort.
perm_cor_p <- NULL perm_cor_p$epi2 <- stat.ranksum (mixt.ranksum = bresat, tissue1="epi", tissue2="epi", nRuns = 10000, mc.cores = 80) perm_cor_p$epi_stroma <- stat.ranksum (mixt.ranksum = bresat, tissue1="epi", tissue2="stroma", nRuns = 10000, mc.cores = 80) perm_cor_p$stroma2 <- stat.ranksum (mixt.ranksum = bresat, tissue1="stroma", tissue2="stroma", nRuns = 10000, mc.cores = 80)
save(perm_cor_p, file="../../mixtApp/data/perm_cor_p.rda")
load(url(("https://rawgit.com/vdumeaux/mixtApp/master/data/perm_cor_p.rda")))
Gene sets from tissue 1 (epi
) are in row and gene sets from tissue 2
(stroma
) are in column.
str(perm_cor_p) head(perm_cor_p$epi_stroma$all)
You can then examine these associations with different plots
cohort_scatterplot(mixt.ranksum = bresat, x.tissue = "epi", x.module = "darkgrey", y.tissue = "stroma", y.module = "grey60", cohort.name = "ern") cohort_scatterplot(mixt.ranksum = bresat, x.tissue = "epi", x.module = "darkgrey", y.tissue = "epi", y.module = "grey60", cohort.name = "ern") cohort_scatterplot(mixt.ranksum = bresat, x.tissue = "stroma", x.module = "darkgrey", y.tissue = "stroma", y.module = "grey60", cohort.name = "ern")
cohort_heatmap(mixt.dat = dat, mixt.ranksum = bresat, tissue = "stroma", module = "grey60", cohort.name = "ern", cl.height = 3)
cohort_heatmap(mixt.dat = dat, mixt.ranksum = bresat, tissue = "epi", module = "darkgrey", cohort.name = "ern", cl.height = 3, orderByTissue = "stroma", orderByModule = "grey60")
cohort_boxplot(mixt.ranksum = bresat, tissue = "epi", module = "darkgrey", cohort.name = "ern", orderByTissue = "stroma", orderByModule = "grey60")
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.