knitr::opts_chunk$set(tidy=FALSE, cache=TRUE, dev="png", message=FALSE, error=FALSE, warning=TRUE)
If you use ANF in published research, please cite:
Tianle Ma, Aidong Zhang, Integrate Multi-omic Data Using Affinity Network Fusion (ANF) for Cancer Patient Clustering, https://arxiv.org/abs/1708.07136
In the following, let's first generate a synthetic dataset and use it for demonstrating the basic usage of ANF.
For complex objects (e.g., patients) with multi-view data, we can use a feature matrix representing each view. For example, gene expression matrix and miRNA expression matrix can represent two views of patients.
In the following, rows of a feature matrix represent objects, and columns represent features. Note each feature matrix contains a number of features from a feature space (corresponding one view). We can concatenate all features together as we will experiment later. However, since features from different feature spaces are usually heterogeneous, it may be a good idea to analyze them in its own feature space first, and then combine the results later. This is basically how ANF works.
For simplicity, let's generate the first view (matrix feature1
) of 200 samples: 100 samples for class 1 (the first 100 rows of matrix feature1
), and 100 samples for class 2 (the last 100 rows of matrix feature1
), using multi-variate Gaussian distribution.
library(MASS) true.class = rep(c(1,2),each=100) feature.mat1 = mvrnorm(100, rep(0, 20), diag(runif(20,0.2,2))) feature.mat2 = mvrnorm(100, rep(0.5, 20), diag(runif(20,0.2,2))) feature1 = rbind(feature.mat1, feature.mat2)
Let's perform KMeans clustering. The Normalized Mutual Information (NMI) is only 0.26.
library(igraph) set.seed(1) km = kmeans(feature1, 2) compare(km$cluster, true.class, method='nmi')
Let's perform spectral clustering using functions in ANF package. The NMI is 0.29, slightly higher than KMeans.
library(ANF) d = dist(feature1) d = as.matrix(d) A1 = affinity_matrix(d, 10) labels = spectral_clustering(A1, 2) compare(labels, true.class, method='nmi')
Similar to the first view, we can generate a second view (matrix feature2
). The rows of feature1
and feature2
have one-to-one correspondence.
feature.mat1 = mvrnorm(100, rep(10, 30), diag(runif(30,0.2,3))) feature.mat2 = mvrnorm(100, rep(9.5, 30), diag(runif(30,0.2,3))) feature2 = rbind(feature.mat1, feature.mat2)
Similarly, the NMI of KMeans clustering and spectral clustering are 0.14 (can be different because of random initialization) and 0.19 respectively.
set.seed(123) km = kmeans(feature2, 2) compare(km$cluster, true.class, method='nmi') d = dist(feature2) d = as.matrix(d) A2 = affinity_matrix(d, 10) labels = spectral_clustering(A2, 2) compare(labels, true.class, method='nmi')
feature.cat = cbind(feature1, feature2) set.seed(1) km = kmeans(feature.cat, 2) compare(km$cluster, true.class, method='nmi')
ANF performs better than KMeans on concatenated features
W = ANF(list(A1, A2), K=30) labels = spectral_clustering(W,2) compare(labels, true.class, method='nmi')
HarmonizedTCGAData
package (https://github.com/BeautyOfWeb/HarmonizedTCGAData) contains three R objects: Wall
, project_ids
and surv.plot
:
Wall
contains a complex list affinity matrices. In fact, Wall
a list (five cancer type) of list (six feature normalization types: raw.all
, raw.sel
, log.all
, log.sel
, vst.sel
, normalized
) of list (three feature spaces or views: fpkm
, mirna
, and methy450
) of matrices. The rownames of each matrix are case IDs (i.e., patient IDs), and the column names of each matrix are aliquot IDs (which contains case IDs as prefix).
project_ids
is a named character vector that maps the case_id (represent a patient) to project_id (one-to-one corresponds to disease type). This is used for evaluating clustering results, such as calculating Normalized Mutual Information (NMI) and Adjusted Rand Index (ARI).
surv.plot
is a data.frame containing patient survival data for survival analysis, providing an "indirect" way to evaluate clustering results.
HarmonizedTCGAData
package contains more details about the above three data objects and simple examples of how to use them. We suggest users to read the vignettes of HarmonizedTCGAData
first since it covers easier examples of using ANF
and HarmonizedTCGAData
packages.
In the following, we are majorly focusing on reproducing the results of the companion paper https://arxiv.org/abs/1708.07136 The code below may be a little harder to follow than simply using ANF
package.
library(ExperimentHub) eh <- ExperimentHub() myfiles <- query(eh, "HarmonizedTCGAData") Wall <- myfiles[[1]] project_ids <- myfiles[[2]] surv.plot <- myfiles[[3]]
We can perform spectral clustering on a patient affinity matrix. Take adrend_gland cancer for example. We can cluseter patients using affinity matrix derived from log2 transformation of raw counts of differentially expressed genes.
affinity.mat <- Wall[["adrenal_gland"]][["log.sel"]][["fpkm"]] labels <- spectral_clustering(affinity.mat, k = 2)
Since we know true disease types, which correspond to project ids in project_ids
, we can calculate NMI and ARI.
true.disease.types <- as.factor(project_ids[rownames(affinity.mat)]) table(labels, true.disease.types) nmi <- igraph::compare(true.disease.types, labels, method = "nmi") adjusted.rand = igraph::compare(true.disease.types, labels, method = "adjusted.rand") # we can also calculate p-value using `surv.plot` data surv.plot <- surv.plot[rownames(affinity.mat), ] f <- survival::Surv(surv.plot$time, !surv.plot$censored) fit <- survival::survdiff(f ~ labels) pval <- stats::pchisq(fit$chisq, df = length(fit$n) - 1, lower.tail = FALSE) message(paste("NMI =", nmi, ", ARI =", adjusted.rand, ", p-val =", pval))
In this package, We have provided a function eval_clu
that streamlines the above process from spectral clustering to calculating NMI, ARI and p-value. Here is an example of how to use eval_clu
:
res <- eval_clu(project_ids, w = affinity.mat, surv = surv.plot)
For adrenal_gland cancer, we only misclassify one out of 253 patients using this affinity matrix. That is a pretty good result (In fact, this the best result we can achieve. Users can try using other matrices and compare the results). However, for many cases, using a single affinity matrix does a "terrible" job in clustering patients into correct disease types. Take uterus cancer for example (the NMI is near 0).
res <- eval_clu(project_ids, w = Wall$uterus$raw.all$fpkm)
Instead of using one affinity matrix, we can "fuse" multiple affinity matrices using ANF, and then perform spectral clustering on the fused affinity matrix.
Let's take uterus cancer for example.
# fuse three matrices: "fpkm" (gene expression), "mirnas" (miRNA expression) and "methy450" (DNA methylation) fused.mat <- ANF(Wall = Wall$uterus$raw.all) # Spectral clustering on fused patient affinity matrix labels <- spectral_clustering(A = fused.mat, k = 2) # Or we can directly evaluate clustering results using function `eval_clu`, which calls `spectral_clustering` and calculate NMI and ARI (and p-value if patient survival data is available. `surv.plot` does not contain information for uterus cancer patients) res <- eval_clu(true_class = project_ids[rownames(fused.mat)], w = fused.mat)
Now NMI is 0.485. The clusering results is significantly better than using a single pateint affinity matrix. This demonstrates the power of ANF.
We have majorly used ANF to produce results in this paper: https://arxiv.org/abs/1708.07136 To reproduce the results, please refer to https://github.com/BeautyOfWeb/Clustering-TCGAFiveCancerTypes/blob/master/vignettes/ANF%20for%20Cancer%20Patient%20Clustering.Rmd (the last section).
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.