Machine learning methods have long been central to computational biology. The BiocAIML package aims to illustrate the use of new approaches to machine learning with R in the context of genome research.
An illustrative dataset derived from the Cancer Genome Atlas (TCGA)
is available with the BiocAIML package. This dataset will
be retrieved from the cloud using r BiocStyle::Biocpkg("curatedTCGAData")
,
massaged to include clinical data published in
@Brennan2013
and cached for future use.
suppressPackageStartupMessages({ library(BiocAIML) library(survival) library(rpart) library(SummarizedExperiment) }) gbmse = build_gbm_se() gbmse
As a sanity check, we show that MGMT methylation status is associated with longer survival times in this dataset.
xm = gbmse[, gbmse$mgmt_status !="" & gbmse$vital_status !=""] ss = Surv(xm$os_days, 1*(xm$vital_status=="DECEASED")) plot(survfit(ss~xm$mgmt_status), lty=1:2) legend(900, .95, lty=c(1,2), legend=c("MGMT methylated", "unmethylated")) title("Time-on-study/vital status for 123 GBM patients\nanalyzed in Brennan et al. PMID 24120142")
Let's pick a random sample of 100 genes and classify the
'expression-based' subtype of GBM using r CRANpkg("survival")
's
rpart
.
set.seed(1234) xms = xm[sample(seq_len(nrow(xm)), size=100),] xmsdf = data.frame(cl=xms$expression_subclass, t(assay(xms))) rp1 = rpart(cl~., data=xmsdf) tt = table(predicted=predict(rp1, type="class"), given=na.omit(xmsdf$cl)) tt
There are r sum(tt)-sum(diag(tt))
"errors" in 122 predictions.
The associated tree is:
plot(rp1) text(rp1)
and the cross-validated error profile is
plotcp(rp1)
This indicates that the best tree obtainable with these features (100 randomly sampled genes) has 8 nodes and a relative error (compared to declaring all patients to have the majority class) of around 80%.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.