Whole transcriptomic profiles are useful for studying the expression levels
of thousands of genes across samples. Clustering algorithms are used to
identify patterns in these profiles to determine clinically relevant
subgroups. Feature selection is a critical integral part of the process.
Currently, there are many feature selection and clustering methods to identify
the relevant genes and perform clustering of samples. However, choosing the
appropriate methods is difficult as recent work demonstrates that no method is
the clear winner. Hence, we present an R-package called multiClust
that
allows researchers to experiment with the choice of combination of methods
for gene selection and clustering with ease. In addition, using multiClust,
we present the merit of gene selection and clustering methods in the context
of clinical relevance of clustering, specifically clinical outcome.
Our integrative R-package contains:
A function to read in gene expression data and format appropriately for analysis in R.
Four different ways to select the number of genes a. Fixed b. Percent c. Poly d. GMM
Four gene ranking options that order genes based on different statistical criteria a. CV_Rank b. CV_Guided c. SD_Rank d. Poly
Two ways to determine the cluster number a. Fixed b. Gap Statistic
Two clustering algorithms a. Hierarchical clustering b. K-means clustering
A function to calculate average gene expression in each sample cluster
Function Workflow
The seven functions listed below should be used in the following order:
input_file
, a function to read-in a text file containing the matrix
of gene probes and samples to analyze.
number_probes
, a function to determine the number of gene probes to
select for in the feature selection process.
probe_ranking
, a function to rank and select for probes using one of
the available ranking options.
number_clusters
, a function to determine the number of clusters to be
used to cluster gene probes and samples.
cluster_analysis
, a function to perform Kmeans or Hierarchical
clustering analysis of the selected gene probe expression data.
avg_probe_exp
, a function to produce a matrix containing the average
expression of each gene probe within each sample cluster.
surv_analysis
, a function to produce Kaplan-Meier Survival Plots
after the discretization of samples into different clusters.
Other Functions Included
WriteMatrixToFile
, a function to write a data matrix to a text file.
nor.min.max
, a function to that uses feature scaling to normalize
values in a matrix between 0 and 1.
In order to use multiClust
, the user will need two text files.
The first file is a gene probe expression dataset. This file should be a matrix with columns being the samples and the rows being genes or probes.
The second file contains the patient clinical parameters. This file should be a matrix consisting of two columns. The first column contains the patient survival time and the second column contains the patient survival event occurrence. Survival time should be recorded in months. Survival event occurrence should be indicated by a 0 or 1. A 0 indicates censorship and a 1 indicates an event has occurred.
In this example, a gene expression dataset, GSE2034, will be obtained from the Gene Expression Omnibus (GEO) website http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=gse2034.
The following code can be used to extract the gene expression data and clinical information from the series matrix text zip file. When using this code, be sure that you have a strong internet connection. The GSE datasets are very large!
# Load GEO and biobase libraries library(GEOquery) library(Biobase) library(multiClust) library(preprocessCore) library(ctc) library(gplots) library(dendextend) library(graphics) library(grDevices) library(amap) library(survival)
# Obtain GSE series matrix file from GEO website using getGEO function gse <- getGEO(GEO="GSE2034") # Save the gene expression matrix as an object data.gse <- exprs(gse[[1]]) # Save the patient clinical data to an object pheno <- pData(phenoData(gse[[1]])) # Write the gene expression and clinical data to text files WriteMatrixToFile(tmpMatrix=data.gse, tmpFileName="GSE2034.expression.txt", blnRowNames=TRUE, blnColNames=TRUE) WriteMatrixToFile(tmpMatrix=pheno, tmpFileName="GSE2034.clinical.txt", blnRowNames=TRUE, blnColNames=TRUE)
In the event that an error is produced when trying to obtain GSE datasets from the NCBI GEO FTP site, please use this alternative method:
After downloading the GSE series matrix file and moving it to a directory the following code can be used.
# Obtain GSE series matrix file from GEO website using getGEO function gse <- getGEO(filename="GSE2034_series_matrix.txt.gz") # Save the gene expression matrix as an object data.gse <- exprs(gse[[1]]) # Save the patient clinical data to an object pheno <- pData(phenoData(gse[[1]])) # Write the gene expression and clinical data to text files WriteMatrixToFile(tmpMatrix=data.gse, tmpFileName="GSE2034.expression.txt", blnRowNames=TRUE, blnColNames=TRUE) WriteMatrixToFile(tmpMatrix=pheno, tmpFileName="GSE2034.clinical.txt", blnRowNames=TRUE, blnColNames=TRUE)
Some datasets on GEO are already normalized using Robust Multichip Average (RMA) procedure available in the 'affy' R pacakge or quantile normalization and log2 scaling using the 'preprocessCore' R package.
For datasets that are not normalized, the following code can be used to extract the gene expression matrix from the series matrix text file, quantile normalize the data, log2 scale the data, and lastly write the data to a text file.
# Obtain GSE series matrix file from GEO website using getGEo function gse <- getGEO(GEO="GSE2034") # Save the gene expression matrix as an object data.gse <- exprs(gse) # Quantile normalization of the dataset data.norm <- normalize.quantiles(data.gse, copy=FALSE) # shift data before log scaling to prevent errors from log scaling # negative numbers if (min(data.norm)> 0) { } else { mindata.norm=abs(min(data.norm)) + .001 data.norm=data.norm + mindata.norm } # Log2 scaling of the dataset data.log <- t(apply(data.norm, 1, log2)) # Write the gene expression and clinical data to text files WriteMatrixToFile(tmpMatrix=data.log, tmpFileName="GSE2034.normalized.expression.txt", blnRowNames=TRUE, blnColNames=TRUE)
As mentioned in the beginning of this section, the patient survival information file should be a matrix consisting of two columns. However, when the patient clinical information is written to a text file, it is generally a matrix of more than two columns with all of the clinical parameters.
It is recommended that this text file be loaded in R or Microsoft Excel to obtain the two columns of survival information needed. Below is an example of what the patient survival information file should look like:
# Obtain clinical outcome file clin_file <- system.file("extdata", "GSE2034-RFS-clinical-outcome.txt", package="multiClust") # Load in survival information file clinical <- read.delim2(file=clin_file, header=TRUE) # Display first few rows of the file clinical[1:5, 1:2] # Column one represents the survival time in months # Column two represents the relapse free survival (RFS) event
The first function of multiClust
is used to load a gene expression dataset
into R and format the matrix so that the probe or gene names are the rownames
of the matrix.
In this example, the GSE2034 dataset is loaded into the R console using the
input_file
function.
Function Arguments
Example
# Obtain gene expression matrix exp_file <- system.file("extdata", "GSE2034.normalized.expression.txt", package= "multiClust") # Load the gene expression matrix data.exprs <- input_file(input=exp_file) # View the first few rows and columns of the matrix data.exprs[1:4,1:4]
In this example, the gene expression matrix is stored into the object "data.exprs".
The second function of this package, number_probes
is a function used to
specify the number of desired probes or genes the user would like to select for
in their gene selection analysis.
Function Arguments
This function has multiple arguments. The first argument "input" represents the string name of the expression matrix text file to be loaded into R.
The second argument "data.exp" is the object containing the expression matrix
The last four arguments "Fixed", "Percent", "Poly", and "Adaptive" are the four different methods that determine how the number of genes are selected. The default option in the function is set to "Fixed".
Fixed is an option where the user provides a positive integer to specify desired number of genes or probes to select for. The default value of "Fixed" is set to 1000. If the user writes:
# Example 1: Choosing a fixed gene or probe number # Obtain gene expression matrix exp_file <- system.file("extdata", "GSE2034.normalized.expression.txt", package="multiClust") gene_num <- number_probes(input=exp_file, data.exp=data.exprs, Fixed=300, Percent=NULL, Poly=NULL, Adaptive=NULL)
As a result, 300 genes from the dataset will be selected during the gene or probe selection process.
# Example 2: Choosing 50% of the total selected gene probes in a dataset # Obtain gene expression matrix exp_file <- system.file("extdata", "GSE2034.normalized.expression.txt", package="multiClust") gene_num <- number_probes(input=exp_file, data.exp=data.exprs, Fixed=NULL, Percent=50, Poly=NULL, Adaptive=NULL)
As a result, 50% of the total dataset probes, or 300 probes, will be selected.
# Example 3: Choosing the polynomial selected gene probes in a dataset # Obtain gene expression matrix exp_file <- system.file("extdata", "GSE2034.normalized.expression.txt", package="multiClust") gene_num <- number_probes(input=exp_file, data.exp=data.exprs, Fixed=NULL, Percent=NULL, Poly=TRUE, Adaptive=NULL)
# Example 4: Choosing the Adaptive Gaussian Mixture Modeling method gene_num <- number_probes(input=exp_file, data.exp=data.exprs, Fixed=NULL, Percent=NULL, Poly=NULL, Adaptive=TRUE)
The "Mclust" function from the package mclust https://cran.r-project.org/web/packages/mclust/mclust.pdf is used to apply Gaussian mixture modeling to the inputted gene expression matrix. In addition, files containing information about the dataset's mean, variance, mixing proportion, and gaussian assignment are also outputted.
It should be noted that when choosing one of the three methods of gene or probe number selection, the other methods should be set equal to "NULL". Furthermore, the Adaptive option has a very long computational time.
Function Output
This function returns an object containing the number of genes/probes to select for during the feature selection process.
The third function of this package, probe_ranking
is used to select the most
informative gene probes within a dataset. This function allows the user choose
from one of four different probe ranking algorithms.
Function Arguments
This function has multiple arguments. The first argument "input" represents the string name of the expression matrix text file to be loaded into R.
The second argument "probe_number" is an output of the number probes
function. For this argument, the user specifies the number of genes to
select for.
The third argument "probe_num selection" is a string specifying which
type of probe number selection method was used for the number_probes
function.
The fourth argument "data.exp" is the object containing the expression matrix.
The last argument "method" is a string that specifies which of the four probe ranking methods to use.
List of the Five Probe Ranking Methods
CV_Rank is a gene probe ranking method that selects for probes using the coefficient of variation of the entire dataset (CV).
CV_Guided is a gene probe ranking method that uses the coefficient of variation (CV) for each probe to select for probes.
SD_Rank is a gene probe ranking method that selects for probes with the highest standard deviation within the dataset.
Poly is a ranking method that fits three second degree
polynomial functions of mean and standard deviation to the dataset to select
the most variable probes in the dataset. This is the only method that does
not use a probe number input from the number_probes
function. When using
this ranking method "probe_num" can be set to NULL.
Example
Below is an example of how to use the probe_ranking
function:
# Obtain gene expression matrix exp_file <- system.file("extdata", "GSE2034.normalized.expression.txt", package="multiClust") # Load the gene expression matrix data.exprs <- input_file(input=exp_file) # Call probe_ranking function # Select for 500 probes # Choose genes using the SD_Rank method ranked.exprs <- probe_ranking(input=exp_file, probe_number=300, probe_num_selection="Fixed_Probe_Num", data.exp=data.exprs, method="SD_Rank") # Display the first few columns and rows of the matrix ranked.exprs[1:4,1:4]
Function Outputs
The output of this function is a text file containing a selected gene or probe expression matrix. In addition, the selected expression matrix is stored into the chosen variable "ranked.exprs".
The fourth function is this package, number_clusters
, is a function used to
designate the number of clusters that samples will be assigned to.
Function Arguments
The "data.exp" argument represents an object containing the original gene
expression matrix to be used for clustering of genes and samples.
This object is an output of the input_file
function.
The "Fixed" argument is a positive integer used to represent the number of clusters the samples and probes will be divided into.
The "gap_statistic" argument is a logical indicating whether to calculate
the optimal number of clusters to divide the samples into using a gap
statistic function clus_Gap
from the package cluster
.
Note
The user should only choose either the "Fixed" or "gap_statistic" option, not both. When using the gap_statistic option, change the argument to TRUE and "Fixed" to NULL.
Fixed Cluster Number Example
Below is an example of how to use the number_clusters
function with a fixed
number of clusters:
# Call the number_clusters function # data.exp is the original expression matrix object outputted from # the input_file function # User specifies that samples will be separated into 3 clusters # with the "Fixed" argument cluster_num <- number_clusters(data.exp=data.exprs, Fixed=3, gap_statistic=NULL)
The output from this example will be the object "cluster_num" with a value of 3.
Gap statistic Example
The number_clusters
function can be called in a similar way to use the
gap_statistic option:
# Call the number_clusters function # data.exp is the original expression matrix object ouputted from # the input_file function # User chooses the gap_statistic option by making gap_statistic equal TRUE # The Fixed argument is also set to NULL cluster_num <- number_clusters(data.exp=data.exprs, Fixed=NULL, gap_statistic=TRUE)
The gap_statistic option has a very long computational time and can take up to several hours. The output of this function will be a positive integer indicating the optimal number of clusters to divide samples into.
The fifth function in this package, cluster_analysis
is a function used
to perform Kmeans or Hierarchical clustering analysis of genes/probes and
samples after undergoing a feature selection process in the probe_ranking
function.
Function Arguments:
The first argument "sel.exp", is an object containing the numeric selected
gene expression matrix. This object is an output of the probe_ranking
function.
The second argument "cluster_type" is a string indicating the type of clustering method to use. Kmeans or HClust are the two options.
The third argument "distance" is a string describing the distance metric
to use for Hierarchical clustering via the dist
function. dist
uses
a default distance metric of Euclidean distance. Options include
one of "euclidean", "maximum", manhattan", "canberra", "binary", or
"minkowski". Kmeans does not use a distance metric.
The fourth argument, "linkage_type" is a string describing the linkage
metric to be used for hierarchical clustering in the hclust
function.
The default is "ward.D2", however other options include "average", "complete",
"median", "centroid", "single", and "mcquitty". Kmeans does not use a
linkage metric.
The fifth argument, "gene_distance" is a string describing the distance
measure to be used for the Dist
function when performing hierarchical
clustering of genes. Options include one of "euclidean", "maximum",
"manhattan", "canberra", "binary", "pearson", "abspearson", "correlation",
"abscorrelation", "spearman" or "kendall". The default of gene_distance is
set to "correlation". The argument can be set to NULL when Kmeans
clustering is used.
The sixth argument "num_clusters" is a positive integer to specify the
number of clusters samples will be divided into. This number is
determined by the number_clusters
function.
The seventh argument "data_name" is a string indicating the cancer type and/or name of the dataset being analyzed. This name will be used to label the sample dendrograms and heatmap files.
The eighth argument "probe_rank" is a string indicating the feature
selection method used in the probe_ranking
function. Options include
"CV_Rank", "CV_Guided", "SD_Rank", and "Poly".
The ninth argument "probe_num_selection" is a string indicating the way
in which probes were selected in the number_probes
function. Options
include "Fixed_Probe_Num", "Percent_Probe_Num", "Poly_Probe_Num",
and "Adaptive_Probe_Num".
The tenth argument, "cluster_num_selection" is a string indicating how the number of clusters were determined in the number_clusters function. Options include "Fixed_Clust_Num" and "Gap_Statistic".
Function Outputs
A CSV file containing the sample names and their respective cluster. An object containing a vector of the sample names and their cluster number is returned.
When hierarchical clustering is chosen as the cluster method, a pdf file of the sample dendrogram as well as atr, gtr, and cdt files for viewing in Java TreeView are outputted. In the Java TreeView heatmap, the samples are clustered by the method indicated by the "linkage_type" argument while genes are clustered by the method indicated by the "gene_distance" argument. The default of sample clustering for the Java TreeView heatmap is "ward.D2" and the default for gene clustering is centered pearson "correlation".
Hierarchical Clustering Example
Below is an example of how to use the cluster_analysis
function using the
Hierarchical clustering option:
# Call the cluster_analysis function hclust_analysis <- cluster_analysis(sel.exp=ranked.exprs, cluster_type="HClust", distance="euclidean", linkage_type="ward.D2", gene_distance="correlation", num_clusters=3, data_name="GSE2034 Breast", probe_rank="SD_Rank", probe_num_selection="Fixed_Probe_Num", cluster_num_selection="Fixed_Clust_Num") # Display the first few columns and rows of the object head(hclust_analysis)
Function Outputs
The outputs from this example would be an object "hclust_analysis" containing a vector of the sample names and cluster number. In addition, a sample dendrogram pdf file would be written. Lastly, the atr, gtr, and cdt files are outputted to view a heatmap of the genes and samples in Java TreeView.
Note
The cluster_analysis
function does not display the sample dendrogram
and heatmap in the console. To show what these images would look like,
the following examples are provided below.
Java TreeView Heatmap Example
Below is an example of a heatmap produced in Java TreeView from the atr, gtr, abnd cdt files produced by this function. In this heatmap, rows represent genes and columns represent samples.
# retrieve file path for image fpath <- system.file("extdata", "GSE2034_Breast.SD_Rank.ward.D2.Fixed_Probe_Num.Fixed_Clust_Num.jpg", package="multiClust") knitr::include_graphics(path = fpath)
Sample Dendrogram Example
Below is an example of a sample dendrogram outputted by this function:
# Produce a sample dendrogram using the selected gene/probe expression object # In the function, cluster_analysis the dendrogram is not displayed in # the console. # Make the selected expression object numeric colname <- colnames(ranked.exprs) ranked.exprs <- t(apply(ranked.exprs, 1,as.numeric)) colnames(ranked.exprs) <- colname # Normalization of the selected expression matrix before clustering norm.sel.exp <- t(apply(ranked.exprs, 1, nor.min.max)) hc <- hclust(dist(x=t(norm.sel.exp), method="euclidean"), method="ward.D2") dc <- as.dendrogram(hc) # Portray the samples in each cluster as a different color dc <- set(dc, "branches_k_color", value=1:3, k=3) # Make a vector filled with blank spaces to replace sample names with # blank space vec <- NULL len <- length(labels(dc)) for (i in 1:len) { i <- "" vec <- c(vec, i) } # Remove sample labels from dendrogram dc <- dendextend::set(dc, "labels", vec) # Plot the sample dendrogram plot(dc, main=paste("GSE2034 Breast", "Sample Dendrogram", sep=" ")) # Display orange lines on the plot to clearly separate each cluster rect.dendrogram(dc, k=3, border='orange')
Kmeans Clustering Example
Below is an example of how to use the cluster_analysis
function using the
Kmeans clustering option:
# Call the cluster_analysis function kmeans_analysis <- cluster_analysis(sel.exp=ranked.exprs, cluster_type="Kmeans", distance=NULL, linkage_type=NULL, gene_distance=NULL, num_clusters=3, data_name="GSE2034 Breast", probe_rank="SD_Rank", probe_num_selection="Fixed_Probe_Num", cluster_num_selection="Fixed_Clust_Num") # Display the first few rows and columns of the object head(kmeans_analysis)
Function Outputs
The output from this example would solely be an object "kmeans_analysis" containing the vector of sample names and cluster number. There would be no sample dendrogram file.
The sixth function in this package avg_probe_exp
is used to determine the
average expression of each gene/probe for the samples in a particular
cluster.
Function Arguments
The first argument "sel.exp" is an object containing the numeric selected
gene/prone expression matrix. This object is an output of the probe_ranking
function.
The second argument "samp_cluster" is an object vector containing the
samples and the cluster number they belong to. This is an output of the
cluster_analysis
function.
The third argument "data_name" is a string indicating the cancer type and name of the dataset being analyzed.
The fourth argument "cluster_type" is a string indicating the type of
clustering method used in the cluster_analysis
function. "Kmeans" or
"HClust" are the two options.
The fifth argument "distance" is a string describing the distance metric
used for HClust in the cluster_analysis
function. Options include one of
"euclidean", "maximum", manhattan", "canberra", "binary", or "minkowski".
The sixth argument "linkage_type" is a string describing the linkage
metric used in the HClust method in the cluster_analysis
function.
Options include "ward.D2", "average", "complete", "median", "centroid",
"single", and "mcquitty".
The seventh argument "probe_rank" is a string indicating the feature
selection method used in the probe_ranking
function. Options include
"CV_Rank", "CV_Guided", "SD_Rank", and "Poly".
The eighth argument "probe_num_selection" is a string indicating the way
in which probes were selected in the number_probes
function. Examples
include "Fixed_Probe_Num", "Percent_Probe_Num", "Poly_Probe_Num", and
"Adaptive_Probe_Num".
The ninth argument "cluster_num_selection" is a string indicating how the
number of clusters were determined in the number_clusters
function.
Examples include "Fixed_Clust_Num" and "Gap_Statistic".
Example
Below is an example of how to use the avg_probe_exp
function after
performing Kmeans clustering analysis with the cluster_analysis
function:
# Call the avg_probe_exp function avg_matrix <- avg_probe_exp(sel.exp=ranked.exprs, samp_cluster=kmeans_analysis, data_name="GSE2034 Breast", cluster_type="Kmeans", distance=NULL, linkage_type=NULL, probe_rank="SD_Rank", probe_num_selection="Fixed_Probe_Num", cluster_num_selection="Fixed_Clust_Num") # Display the first few rows and columns of the matrix head(avg_matrix)
Function Outputs
The outputs from this function are a text file containing the matrix of average expression of each probe/gene in each of the different clusters. In addition the object "avg_matrix" would also contain the average expression matrix.
After dividing the selected gene probes and samples into their respective
clusters, the surv_analysis
function can be used to produce Kaplan-Meier
Survival Plots. This function uses a Cox Proportional Hazard Model to portray
the survival probabilities of the patients within each cluster over time.
Function Arguments
The first argument "samp_cluster" is an object vector containing the
samples and the cluster number they belong to. This object is an output
of the cluster_analysis
function.
The second argument "clinical" is a string indicating the name of the text file containing patient clinical information. This file should be a matrix consisting of two columns. The first column contains the patient survival time information in months. The second column indicates the occurrence of a censorship (0) or an event (1).
The third argument "survival_type" is a string specifying the type of survival event being analyzed. Examples include "Disease-free survival (DFS)", "Overall Survival (OS)", "Relapse-free survival (RFS)", etc.
The fourth argument "data_name" is a string indicating the name to be used to label the Kaplan-Meier Survival Plot.
The fifth argument "cluster_type" is a string indicating the type of
clustering method used in the cluster_analysis
function.
"Kmeans" or "HClust" are the two options.
The sixth argument "distance" is a string describing the distance
metric uses for HClust in the cluster_analysis
function. Options
include one of "euclidean", "maximum", manhattan", "canberra", "binary",
or "minkowski".
The seventh argument "linkage_type" is a string describing the linkage
metric used in the cluster_analysis
function. Options include "ward.D2",
"average", "complete", "median", "centroid", "single", and "mcquitty".
The eighth argument "probe_rank" is a string indicating the feature
selection method used in the probe_ranking
function. Options include
"CV_Rank", "CV_Guided", "SD_Rank", and "Poly".
The ninth argument "probe_num_selection" is a string indicating the way in
which probes were selected in the number_probes
function. Options
include "Fixed_Probe_Num", "Percent_Probe_Num", "Poly_Probe_Num",
and "Adaptive_Probee_Num".
The tenth argument "cluster_num_selection" is a string indicating how
the number of clusters were determined in the number_clusters
function.
Options include "Fixed_Clust_Num" and "Gap_Statistic".
Below is an example of how to use the surv_analysis
function:
Example
# Obtain clinical outcome file clin_file <- system.file("extdata", "GSE2034-RFS-clinical-outcome.txt", package="multiClust") # Call the avg_probe_exp function surv <- surv_analysis(samp_cluster=kmeans_analysis, clinical=clin_file, survival_type="RFS", data_name="GSE2034 Breast", cluster_type="Kmeans", distance=NULL, linkage_type=NULL, probe_rank="SD_Rank", probe_num_selection="Fixed_Probe_Num", cluster_num_selection="Fixed_Clust_Num") # Display the survival P value surv
Function Outputs
This function outputs a pdf Kaplan-Meier Survival Plot of the patient survival time in months vs. the survival probability. In this example, the object "surv" will also contain the Cox Survival P value.
Note
The surv_analysis
function also does not display the Kaplan-Meier plot
in the console, but rather outputs the plot as a pdf file. Below is an
example of what the plots would look like.
Kaplan-Meier Survival Plot
# Obtain clinical outcome file clin_file <- system.file("extdata", "GSE2034-RFS-clinical-outcome.txt", package="multiClust") # Produce a Kaplan-Meier plot using the sample cluster information # from the cluster_analysis function # Read in survival data text file sample.anns <- read.delim2(file=clin_file, header=TRUE, stringsAsFactors=FALSE) time <- sample.anns[,1] event <- sample.anns[,2] time <- as.numeric(time) event <- as.numeric(event) event_type <- "RFS" time_type <- 'Months' # Determine max number of clusters in samp_cluster samp_cluster <- kmeans_analysis Number_Clusters <- max(samp_cluster) # Produce Kaplan Meier survival plots lev <- '1' for(i in 2:Number_Clusters) { lev <- c(lev, paste('',i, sep='')) } groupfac <- factor(samp_cluster, levels=lev) cox <- coxph(Surv(time, event==1) ~ groupfac) csumm <- summary(cox) cox.p.value <- c(csumm$logtest[3]) p <- survfit(Surv(time,event==1)~strata(samp_cluster)) plot (p,xlab= time_type ,ylab='Survival Probability',conf.int=FALSE, xlim=c(0,100), col=1:Number_Clusters, main=paste("GSE2034 Breast", "Kmeans", "SD_Rank", "RFS", sep =' ')) legend('bottomleft',legend=levels(strata(samp_cluster)),lty=1, col=1:Number_Clusters, text.col=1:Number_Clusters, title='clusters') legend('topright', paste('Pvalue =', signif(csumm$logtest[3],digits=2), sep =''))
Davis, S. and Meltzer, P. S. GEOquery: a bridge between the Gene Expression Omnibus (GEO) and BioConductor. Bioinformatics, 2007, 14, 1846-1847
Orchestrating high-throughput genomic analysis with Bioconductor. W. Huber, V.J. Carey, R. Gentleman, ..., M. Morgan Nature Methods, 2015:12, 115.
Benjamin Milo Bolstad. preprocessCore: A collection of pre-processing functions. R package version 1.30.0.
Chris Fraley, Adrian E. Raftery, T. Brendan Murphy, and Luca Scrucca (2012) mclust Version 4 for R: Normal Mixture Modeling for Model-Based Clustering, Classification, and Density Estimation Technical Report No. 597, Department of Statistics, University of Washington
Antoine Lucas and Laurent Gautier (2005). ctc: Cluster and Tree Conversion.. R package version 1.42.0. http://antoinelucas.free.fr/ctc
Gregory R. Warnes, Ben Bolker, Lodewijk Bonebakker, Robert Gentleman, Wolfgang Huber Andy Liaw, Thomas Lumley, Martin Maechler, Arni Magnusson, Steffen Moeller, Marc Schwartz and Bill Venables (2015). gplots: Various R Programming Tools for Plotting Data. R package version 2.17.0. http://CRAN.R-project.org/package=gplots
Therneau T (2015). _A Package for Survival Analysis in S. version 2.38,
Maechler, M., Rousseeuw, P., Struyf, A., Hubert, M., Hornik, K.(2015). cluster: Cluster Analysis Basics and Extensions. R package version 2.0.3.
Tal Galili (2015). dendextend: Extending R's Dendrogram Functionality. R package version 1.0.1. http://CRAN.R-project.org/package=dendextend
R Core Team (2015). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. URL https://www.R-project.org/.
Saldanha, A. J. "Java Treeview--extensible Visualization of Microarray Data."Bioinformatics 20.17 (2004): 3246-248.
Antoine Lucas (2014). amap: Another Multidimensional Analysis Package. R package version 0.8-14. http://CRAN.R-project.org/package=amap
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.