if (!require(Umpire)) { knitr::opts_chunk$set(eval = FALSE) }
In this vignette, we ilustrate how to apply the GenAlgo
package
to the problem of feature selection in an "omics-scale" data set. We
start by loading the packages that we will need.
library(GenAlgo) library(Umpire) library(oompaBase)
We will use the Umpire
package to simulate a comnplex enough data set
to stress our feature selection algorithm. We begin by setting up a
time-to-event model, built on an exponential baseline.
set.seed(391629) sm <- SurvivalModel(baseHazard=1/5, accrual=5, followUp=1)
Next, we build a "cancer model" with six subtypes.
nBlocks <- 20 # number of possible hits cm <- CancerModel(name="cansim", nPossible=nBlocks, nPattern=6, OUT = function(n) rnorm(n, 0, 1), SURV= function(n) rnorm(n, 0, 1), survivalModel=sm) ### Include 100 blocks/pathways that are not hit by cancer nTotalBlocks <- nBlocks + 100
Now define the hyperparameters for the models.
### block size blockSize <- round(rnorm(nTotalBlocks, 100, 30)) ### log normal mean hypers mu0 <- 6 sigma0 <- 1.5 ### log normal sigma hypers rate <- 28.11 shape <- 44.25 ### block corr p <- 0.6 w <- 5
Now set up the baseline "Engine".
rho <- rbeta(nTotalBlocks, p*w, (1-p)*w) base <- lapply(1:nTotalBlocks, function(i) { bs <- blockSize[i] co <- matrix(rho[i], nrow=bs, ncol=bs) diag(co) <- 1 mu <- rnorm(bs, mu0, sigma0) sigma <- matrix(1/rgamma(bs, rate=rate, shape=shape), nrow=1) covo <- co *(t(sigma) %*% sigma) MVN(mu, covo) }) eng <- Engine(base)
We alter the means if there is a hit, or else build it using the original engine components.
altered <- alterMean(eng, normalOffset, delta=0, sigma=1) object <- CancerEngine(cm, eng, altered) summary(object)
rm(altered, base, blockSize, cm, eng, mu0, nBlocks, nTotalBlocks,
p, rate, rho, shape, sigma0, sm, w)
Now we can use this elaborate setup to generate the simulated data.
train <- rand(object, 198) tdata <- train$data pid <- paste("PID", sample(1001:9999, 198+93), sep='') rownames(train$clinical) <- colnames(tdata) <- pid[1:198]
Of course, to make things harder, we will add noise to the simulated measurements.
noise <- NoiseModel(3, 1, 1e-16) train$data <- log2(blur(noise, 2^(tdata))) sum(is.na(train$data)) rm(tdata) summary(train$clinical) summary(train$data[, 1:3])
Now we can also simualte a validation data set.
valid <- rand(object, 93) vdata <- valid$data vdata <- log2(blur(noise, 2^(vdata))) # add noise sum(is.na(vdata)) vdata[is.na(vdata)] <- 0.26347 valid$data <- vdata colnames(valid$data) <- rownames(valid$clinical) <- pid[199:291] rm(vdata, noise, object, pid) summary(valid$clinical) summary(valid$data[, 1:3])
Now we can start using the GenAlgo
package. The key step is to define
sensible functions that can measure the "fitness" of a solution and to
introduce "mutations". When these functions are called, they are passed a
context
argument that can be used to access extra information about
how to proceed. In this case, that context will be the train
object,
which includes the clinical information about the samples.
Now we can define the fitness function. The idea is to compute the Mahalanobis distance between the two groups (of "Good" or "Bad" outcome samples) in the space defined by the selected features.
measureFitness <- function(arow, context) { predictors <- t(context$data[arow, ]) # space defined by features groups <- context$clinical$Outcome # good or bad outcome maha(predictors, groups, method='var') }
The mutation function randomly chooses any other feature/row to swap out possible predictors of the outcome.
mutator <- function(allele, context) { sample(1:nrow(context$data),1) }
We need to decide how many features to include in a potential predictor (here we use ten). We also need to decide how big a population of feature-sets (here we use 200) should be used in each generation of the genetic algorithm.
set.seed(821831) n.individuals <- 200 n.features <- 10 y <- matrix(0, n.individuals, n.features) for (i in 1:n.individuals) { y[i,] <- sample(1:nrow(train$data), n.features) }
Having chosen the staring population, we can run the first step of the genetic algorithm.
my.ga <- GenAlg(y, measureFitness, mutator, context=train) # initialize summary(my.ga)
To be able to evaluate things later, we save the starting generation
recurse <- my.ga pop0 <- sort(table(as.vector(my.ga@data)))
Realistically, we probably want to run a couple of hundred or even a couple thousand iterations of the algorithm. But, in interests of making the vignette complete ins a reasonable amount of time, we are only going to terate through 20 generations.
NGEN <- 20 diversity <- meanfit <- fitter <- rep(NA, NGEN) for (i in 1:NGEN) { recurse <- newGeneration(recurse) fitter[i] <- recurse@best.fit meanfit[i] <- mean(recurse@fitness) diversity[i] <- popDiversity(recurse) }
Plot max and mean fitness by generation. This figure shows that both the mean and the maximum fitness are increasing.
plot(fitter, type='l', ylim=c(0, 1.5), xlab="Generation", ylab="Fitness") abline(h=max(fitter), col='gray', lty=2) lines(fitter) lines(meanfit, col='gray') points(meanfit, pch=16, col=jetColors(NGEN)) legend("bottomleft", c("Maximum", "Mean"), col=c("black", "blue"), lwd=2)
Plot the diversity of the population, to see that it is deceasing.
plot(diversity, col='gray', type='l', ylim=c(0,10), xlab='', ylab='', yaxt='n') points(diversity, pch=16, col=jetColors(NGEN))
See which predictors get selected most frequently in the latest generation.
sort(table(as.vector(recurse@data)))
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.