tune.block.splsda: Tuning function for block.splsda method (N-integration with...
In mixOmicsTeam/mixOmics: Omics Data Integration Project

tune.block.splsda

R Documentation

Tuning function for block.splsda method (N-integration with sparse Discriminant Analysis)

Description

Computes M-fold or Leave-One-Out Cross-Validation scores based on a user-input grid to determine the optimal parameters for method block.splsda.

Usage

tune.block.splsda(
  X,
  Y,
  indY,
  ncomp = 2,
  tol = 1e-06,
  max.iter = 100,
  near.zero.var = FALSE,
  design,
  scale = TRUE,
  test.keepX,
  already.tested.X,
  validation = "Mfold",
  folds = 10,
  nrepeat = 1,
  signif.threshold = 0.01,
  dist = "max.dist",
  measure = "BER",
  weighted = TRUE,
  progressBar = FALSE,
  light.output = TRUE,
  BPPARAM = SerialParam(),
  seed = NULL
)

Arguments

`X`	A named list of data sets (called 'blocks') measured on the same samples. Data in the list should be arranged in matrices, samples x variables, with samples order matching in all data sets.
`Y`	a factor or a class vector for the discrete outcome.
`indY`	To supply if `Y` is missing, indicates the position of the matrix response in the list `X`.
`ncomp`	the number of components to include in the model. Default to 2. Applies to all blocks.
`tol`	Positive numeric used as convergence criteria/tolerance during the iterative process. Default to `1e-06`.
`max.iter`	Integer, the maximum number of iterations. Default to 100.
`near.zero.var`	Logical, see the internal `nearZeroVar` function (should be set to TRUE in particular for data with many zero values). Setting this argument to FALSE (when appropriate) will speed up the computations. Default value is FALSE.
`design`	numeric matrix of size (number of blocks in X) x (number of blocks in X) with values between 0 and 1. Each value indicates the strenght of the relationship to be modelled between two blocks; a value of 0 indicates no relationship, 1 is the maximum value. Alternatively, one of c('null', 'full') indicating a disconnected or fully connected design, respecively, or a numeric between 0 and 1 which will designate all off-diagonal elements of a fully connected design (see examples in `block.splsda`). If `Y` is provided instead of `indY`, the `design` matrix is changed to include relationships to `Y`.
`scale`	Logical. If scale = TRUE, each block is standardized to zero means and unit variances (default: TRUE)
`test.keepX`	A named list with the same length and names as X (without the outcome Y, if it is provided in X and designated using `indY`). Each entry of this list is a numeric vector for the different keepX values to test for that specific block. If set to NULL, ncomp is tuned.
`already.tested.X`	Optional, if `ncomp > 1` A named list of numeric vectors each of length `n_tested` indicating the number of variables to select from the `X` data set on the first `n_tested` components.
`validation`	character. What kind of (internal) validation to use, matching one of `"Mfold"` or `"loo"` (see below). Default is `"Mfold"`.
`folds`	the folds in the Mfold cross-validation. See Details.
`nrepeat`	Number of times the Cross-Validation process is repeated.
`signif.threshold`	numeric between 0 and 1 indicating the significance threshold required for improvement in error rate of the components. Default to 0.01.
`dist`	distance metric to estimate the classification error rate, should be one of "centroids.dist", "mahalanobis.dist" or "max.dist" (see Details). If `test.keepX = NULL`, can also input "all" or more than one distance metric
`measure`	only used when `test.keepX` is not NULL. Measure used when plotting, should be 'BER' or 'overall'
`weighted`	tune using either the performance of the Majority vote or the Weighted vote.
`progressBar`	by default set to `TRUE` to output the progress bar of the computation.
`light.output`	if set to FALSE, the prediction/classification of each sample for each of `test.keepX` and each comp is returned.
`BPPARAM`	A BiocParallelParam object indicating the type of parallelisation. See examples.
`seed`	set a number here if you want the function to give reproducible outputs. Not recommended during exploratory analysis. Note if RNGseed is set in 'BPPARAM', this will be overwritten by 'seed'.

Details

This tuning function should be used to tune the number of components and the keepX parameters in the block.splsda function (N-integration with sparse Discriminant Analysis).

M-fold or LOO cross-validation is performed with stratified subsampling where all classes are represented in each fold.

If validation = "Mfold", M-fold cross-validation is performed. The number of folds to generate is to be specified in the argument folds.

If validation = "loo", leave-one-out cross-validation is performed. By default folds is set to the number of unique individuals.

All combination of test.keepX values are tested. A message informs how many will be fitted on each component for a given test.keepX.

More details about the prediction distances in ?predict and the supplemental material of the mixOmics article (Rohart et al. 2017). Details about the PLS modes are in ?pls.

BER is appropriate in case of an unbalanced number of samples per class as it calculates the average proportion of wrongly classified samples in each class, weighted by the number of samples in each class. BER is less biased towards majority classes during the performance assessment.

Value

A list that contains:

`error.rate`	returns the prediction error for each `test.keepX` on each component, averaged across all repeats and subsampling folds. Standard deviation is also output. All error rates are also available as a list.
`choice.keepX`	returns the number of variables selected (optimal keepX) on each component, for each block.
`choice.ncomp`	returns the optimal number of components for the model fitted with `$choice.keepX`.
`error.rate.class`	returns the error rate for each level of `Y` and for each component computed with the optimal keepX
`predict`	Prediction values for each sample, each `test.keepX`, each comp and each repeat. Only if light.output=FALSE
`class`	Predicted class for each sample, each `test.keepX`, each comp and each repeat. Only if light.output=FALSE
`cor.value`	compute the correlation between latent variables for two-factor sPLS-DA analysis.

If test.keepX = NULL, returns:

`error.rate`	Prediction error rate for each block of `object$X` and each `dist`
`error.rate.per.class`	Prediction error rate for each block of `object$X`, each `dist` and each class
`predict`	Predicted values of each sample for each class, each block and each component
`class`	Predicted class of each sample for each block, each `dist`, each component and each nrepeat
`features`	a list of features selected across the folds (`$stable.X` and `$stable.Y`) for the `keepX` and `keepY` parameters from the input object.
`AveragedPredict.class`	if more than one block, returns the average predicted class over the blocks (averaged of the `Predict` output and prediction using the `max.dist` distance)
`AveragedPredict.error.rate`	if more than one block, returns the average predicted error rate over the blocks (using the `AveragedPredict.class` output)
`WeightedPredict.class`	if more than one block, returns the weighted predicted class over the blocks (weighted average of the `Predict` output and prediction using the `max.dist` distance). See details for more info on weights.
`WeightedPredict.error.rate`	if more than one block, returns the weighted average predicted error rate over the blocks (using the `WeightedPredict.class` output.)
`MajorityVote`	if more than one block, returns the majority class over the blocks. NA for a sample means that there is no consensus on the predicted class for this particular sample over the blocks.
`MajorityVote.error.rate`	if more than one block, returns the error rate of the `MajorityVote` output
`WeightedVote`	if more than one block, returns the weighted majority class over the blocks. NA for a sample means that there is no consensus on the predicted class for this particular sample over the blocks.
`WeightedVote.error.rate`	if more than one block, returns the error rate of the `WeightedVote` output
`weights`	Returns the weights of each block used for the weighted predictions, for each nrepeat and each fold
`choice.ncomp`	For supervised models; returns the optimal number of components for the model for each prediction distance using one-sided t-tests that test for a significant difference in the mean error rate (gain in prediction) when components are added to the model. See more details in Rohart et al 2017 Suppl. For more than one block, an optimal ncomp is returned for each prediction framework.

Author(s)

Florian Rohart, Amrit Singh, Kim-Anh Lê Cao, AL J Abadi

References

Method:

Singh A., Gautier B., Shannon C., Vacher M., Rohart F., Tebbutt S. and Lê Cao K.A. (2016). DIABLO: multi omics integration for biomarker discovery.

mixOmics article:

Rohart F, Gautier B, Singh A, Lê Cao K-A. mixOmics: an R package for 'omics feature selection and multiple data integration. PLoS Comput Biol 13(11): e1005752

Examples

## Set up data

# load data
data("breast.TCGA")

# X data - list of mRNA and miRNA
X <- list(mrna = breast.TCGA$data.train$mrna, mirna = breast.TCGA$data.train$mirna,
          protein = breast.TCGA$data.train$protein)

# Y data - single data set of proteins
Y <- breast.TCGA$data.train$subtype

# subset the X and Y data to speed up computation in this example
set.seed(100)
subset <- mixOmics:::stratified.subsampling(breast.TCGA$data.train$subtype, folds = 3)[[1]][[1]]
X <- lapply(X, function(omic) omic[subset,])
Y <- Y[subset]

# set up a full design where every block is connected
# could also consider other weights, see our mixOmics manuscript
design = matrix(1, ncol = length(X), nrow = length(X),
                dimnames = list(names(X), names(X)))
diag(design) =  0
design

## Tune number of components to keep
tune_res <- tune.block.splsda(X, Y, design = design,
                              ncomp = 5,
                              test.keepX = NULL,
                              validation = "Mfold", nrepeat = 3,
                              dist = "all", measure = "BER",
                              seed = 13)

plot(tune_res)

tune_res$choice.ncomp # 3 components best

## Tune number of variables to keep

# definition of the keepX value to be tested for each block mRNA miRNA and protein
# names of test.keepX must match the names of 'data'
test.keepX = list(mrna = c(10, 30), mirna = c(15, 25), protein = c(4, 8))

# load parallel package
library(BiocParallel)

# run tuning in parallel on 2 cores, output plot on overall error
tune_res <- tune.block.splsda(X, Y, design = design,
                              ncomp = 2,
                              test.keepX = test.keepX,
                              validation = "Mfold", nrepeat = 3,
                              measure = "overall",
                              seed = 13, BPPARAM = SnowParam(workers = 2))

plot(tune_res)
tune_res$choice.keepX

# Now tuning a new component given previous tuned keepX
already.tested.X <- tune_res$choice.keepX
tune_res <- tune.block.splsda(X, Y, design = design,
                              ncomp = 3,
                              test.keepX = test.keepX,
                              validation = "Mfold", nrepeat = 3,
                              measure = "overall",
                              seed = 13, BPPARAM = SnowParam(workers = 2),
                              already.tested.X = already.tested.X)
tune_res$choice.keepX

mixOmicsTeam/mixOmics documentation built on Feb. 13, 2025, 4:39 a.m.