pca | R Documentation |
Performs a principal components analysis on the given data matrix that can contain missing values. If data are complete 'pca' uses Singular Value Decomposition, if there are some missing values, it uses the NIPALS algorithm.
pca(
X,
ncomp = 2,
center = TRUE,
scale = FALSE,
max.iter = 500,
tol = 1e-09,
logratio = c("none", "CLR", "ILR"),
ilr.offset = 0.001,
V = NULL,
multilevel = NULL,
verbose.call = FALSE
)
X |
a numeric matrix (or data frame) which provides the data for the
principal components analysis. It can contain missing values in which case
|
ncomp |
Integer, if data is complete |
center |
(Default=TRUE) Logical, whether the variables should be shifted
to be zero centered. Only set to FALSE if data have already been centered.
Alternatively, a vector of length equal the number of columns of |
scale |
(Default=FALSE) Logical indicating whether the variables should be
scaled to have unit variance before the analysis takes place. The default is
|
max.iter |
Integer, the maximum number of iterations in the NIPALS algorithm. |
tol |
Positive real, the tolerance used in the NIPALS algorithm. |
logratio |
(Default='none') one of ('none','CLR','ILR'). Specifies the log ratio transformation to deal with compositional values that may arise from specific normalisation in sequencing data. Default to 'none' |
ilr.offset |
(Default=0.001) When logratio is set to 'ILR', an offset must be input to avoid infinite value after the logratio transform. |
V |
Matrix used in the logratio transformation if provided. |
multilevel |
sample information for multilevel decomposition for repeated measurements. |
verbose.call |
Logical (Default=FALSE), if set to TRUE then the |
The calculation is done either by a singular value decomposition of the
(possibly centered and scaled) data matrix, if the data is complete or by
using the NIPALS algorithm if there is data missing. Unlike
princomp
, the print method for these objects prints the
results in a nice format and the plot
method produces a bar plot of
the percentage of variance explained by the principal components (PCs).
When using NIPALS (missing values), we make the assumption that the first
(min(ncol(X),
nrow(X)
) principal components will account for
100 % of the explained variance.
Note that scale = TRUE
will throw an error if there are constant
variables in the data, in which case it's best to filter these variables
in advance.
According to Filzmoser et al., a ILR log ratio transformation is more appropriate for PCA with compositional data. Both CLR and ILR are valid.
Logratio transform and multilevel analysis are performed sequentially as
internal pre-processing step, through logratio.transfo
and
withinVariation
respectively.
Logratio can only be applied if the data do not contain any 0 value (for count data, we thus advise the normalise raw data with a 1 offset). For ILR transformation and additional offset might be needed.
pca
returns a list with class "pca"
and "prcomp"
containing the following components:
call |
if |
X |
The input data matrix, possibly scaled and centered. |
ncomp |
The number of principal components used. |
center |
The centering used. |
scale |
The scaling used. |
names |
List of row and column names of data. |
sdev |
The eigenvalues of the covariance/correlation matrix, though the calculation is actually done with the singular values of the data matrix or by using NIPALS. |
loadings |
A length one list of matrix of variable loadings for X (i.e., a matrix whose columns contain the eigenvectors). |
variates |
Matrix containing the coordinate values corresponding to the projection of the samples in the space spanned by the principal components. These are the dimension-reduced representation of observations/samples. |
var.tot |
Total variance in the data. |
prop_expl_var |
Proportion of variance explained per component after setting possible missing values in the data to zero (note that contrary to PCA, this amount may not decrease as the aim of the method is not to maximise the variance, but the covariance between X and the dummy matrix Y). |
cum.var |
The cumulative explained variance for components. |
Xw |
If multilevel, the data matrix with within-group-variation removed. |
design |
If multilevel, the provided design. |
Florian Rohart, Kim-Anh Lê Cao, Ignacio González, Al J Abadi
On log ratio transformations: Filzmoser, P., Hron, K., Reimann, C.: Principal component analysis for compositional data with outliers. Environmetrics 20(6), 621-632 (2009) Lê Cao K.-A., Costello ME, Lakis VA, Bartolo, F,Chua XY, Brazeilles R, Rondeau P. MixMC: Multivariate insights into Microbial Communities. PLoS ONE, 11(8): e0160169 (2016). On multilevel decomposition: Westerhuis, J.A., van Velzen, E.J., Hoefsloot, H.C., Smilde, A.K.: Multivariate paired data analysis: multilevel plsda versus oplsda. Metabolomics 6(1), 119-128 (2010) Liquet, B., Lê Cao, K.-A., Hocini, H., Thiebaut, R.: A novel approach for biomarker selection and the integration of repeated measures experiments from two assays. BMC bioinformatics 13(1), 325 (2012)
nipals
, prcomp
, biplot
,
plotIndiv
, plotVar
and http://www.mixOmics.org
for more details.
# example with missing values where NIPALS is applied
# --------------------------------
data(multidrug)
X <- multidrug$ABC.trans
pca.res <- pca(X, ncomp = 4, scale = TRUE)
plot(pca.res)
print(pca.res)
biplot(pca.res, group = multidrug$cell.line$Class, legend.title = 'Class')
# samples representation
plotIndiv(pca.res, ind.names = multidrug$cell.line$Class,
group = as.numeric(as.factor(multidrug$cell.line$Class)))
# variable representation
plotVar(pca.res, var.names = TRUE, cutoff = 0.4, pch = 16)
## Not run:
plotIndiv(pca.res, cex = 0.2,
col = as.numeric(as.factor(multidrug$cell.line$Class)),style="3d")
plotVar(pca.res, rad.in = 0.5, cex = 0.5, style="3d")
## End(Not run)
# example with imputing the missing values using impute.nipals()
# --------------------------------
data("nutrimouse")
X <- data.matrix(nutrimouse$lipid)
X <- scale(X, center = TRUE, scale = TRUE)
## add missing values to X to impute and compare to actual values
set.seed(42)
na.ind <- sample(seq_along(X), size = 20)
true.values <- X[na.ind]
X[na.ind] <- NA
pca.no.impute <- pca(X, ncomp = 2)
plotIndiv(pca.no.impute, group = nutrimouse$diet, pch = 16)
X.impute <- impute.nipals(X, ncomp = 10)
## compare
cbind('imputed' = round(X.impute[na.ind], 2),
'actual' = round(true.values, 2))
## run pca using imputed matrix
pca.impute <- pca(X.impute, ncomp = 2)
plotIndiv(pca.impute, group = nutrimouse$diet, pch = 16)
# example with multilevel decomposition and CLR log ratio transformation
# (ILR takes longer to run)
# ----------------
data("diverse.16S")
pca.res = pca(X = diverse.16S$data.TSS, ncomp = 3,
logratio = 'CLR', multilevel = diverse.16S$sample)
plot(pca.res)
plotIndiv(pca.res, ind.names = FALSE,
group = diverse.16S$bodysite,
title = '16S diverse data',
legend = TRUE,
legend.title = 'Bodysite')
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.