knitr::opts_chunk$set( collapse = TRUE, comment = "#>", warning = FALSE, error = FALSE, message = FALSE )
The poplin
package aims to provide data processing utilities (imputation,
normalization, and dimension reduction) for LC/MS metabolomics data and a S4
class container for storing and retrieving the resulting outputs, motivated by
the r Biocpkg("SingleCellExperiment")
.
The poplin
class, an extension of the SummarizedExperiment
class, provides
additional containers for data processing results and dimension-reduced
presentations of metabolomics data. poplin
objects can be created via the
constructor of the same name. Posit that we have a feature intensity matrix
generated from spectral data processing packages, such as r Biocpkg("xcms")
.
Note that rows represent features and columns represent samples like
SummarizedExperiment
objects.
library(poplin) nsample <- 20 nfeature <- 100 features <- rlnorm(nsample * nfeature, 10, 1) fmat <- matrix(features, ncol = nsample) cn <- paste0("S", seq_len(nsample)) rn <- paste0("F", seq_len(nfeature)) colnames(fmat) <- cn rownames(fmat) <- rn pd <- poplin( assays = list(raw = fmat), colData = DataFrame(sample_id = cn), rowData = DataFrame(feature_id = rn) ) pd
Alternatively, a poplin
object can be constructed by coercing an existing
SummarizedExperiment
object.
se <- SummarizedExperiment( assays = list(raw = fmat), colData = DataFrame(sample_id = cn), rowData = DataFrame(feature_id = rn) ) pd <- as(se, "poplin") # any opration applied to "se" also works on "pd" assays(pd) dim(pd) head(rowData(pd), 3) head(colData(pd), 3)
To illustrate the methods of the class, we will use the faahko_poplin
data
included in the poplin
package. This data set was generated from the faahko3
object in the r Biocpkg("faahKO")
package, which consists of 12 samples (6
wild-type and 6 FAAH knockout mice) and 206 LC/MS peaks.
data(faahko_poplin)
faahko_poplin
The poplin
class have three data containers: poplinRaw
, poplinData
,
poplinReduced
.
poplinRaw
corresponds to assays
in the SummarizedExperiment
class and is
intended to store raw intensity data. To retrieve the data in this container,
one can use poplin_raw_list()
accessor. This is an alias of assays()
methods from the SummarizedExperiment
class.
## Get a list of raw intensity data sets. poplin_raw_list(faahko_poplin) # alias of assays() ## Get the names of data sets poplin_raw_names(faahko_poplin) # alias of assayNames() ## Get indvidual entries head(poplin_raw(faahko_poplin, 1), 3) # alias of assay()
poplinData
is intended to store processed data that are typically returned
by utility functions in the poplin
package. To retrieve the data in this
container, one can use poplin_data_list()
accessor. Note that each entry
must have the same dimension as returned by dim()
.
poplin_data_list(faahko_poplin) poplin_data_names(faahko_poplin) head(poplin_data(faahko_poplin, "knn"), 3)
poplinReduced
is intended to store dimension-reduced data. To retrieve the
data in this container, one can use poplin_reduced_list()
accessor. Note that
each entry must have the same number of rows as returned by ncol()
.
poplin_reduced_list(faahko_poplin) poplin_reduced_names(faahko_poplin) head(poplin_reduced(faahko_poplin, "pca"), 3)
In the poplin
class, each of these accessors has setter methods so that users
can assign data to individual containers.
## Operations also work on poplinRaw and poplinReduced containers. knn <- poplin_data(faahko_poplin, "knn") empty <- faahko_poplin poplin_data_list(empty) <- list() # replace with empty data poplin_data_list(empty) <- list(knn1 = knn, knn2 = knn) # add data list poplin_data(empty, "knn3") <- knn # add data poplin_data_names(empty) poplin_data_names(empty) <- c("imp1", "imp2", "imp3") # change names poplin_data_names(empty)
In the poplin package, commonly used missing value imputation algorithms are
available via the poplin_impute()
function, which included k-nearest neighbor
(using Gower's distance), random forest, PCA-based methods (e.g., NIPALS PCA,
Bayesian PCA, Probabilistic PCA), and univariate replacement (e.g.,
half-minimum, median, mean). poplin_impute()
can be applied either to a
poplin
object or matrix
. Please refer to the [Visualization] section to
visualize the missingness of the data.
## missing % of raw intensity matrix m <- poplin_raw(faahko_poplin, "raw") 100 * sum(is.na(m)) / prod(dim(faahko_poplin)) ## apply half-mininum imputation to a poplin object res <- poplin_impute(x = faahko_poplin, xin = "raw", xout = "halfmin", method = "univariate", type = "halfmin") ## apply random forest imputation to a matrix poplin_data(res, "rf") <- poplin_impute(x = m, method = "randomforest") poplin_data_list(res)
In metabolomics analysis, the data may need to be normalized to reduce unwanted
sample-to-sample variability. The poplin_normalize()
function provides several
data-driven normalization approaches, such as probabilistic quotient
normalization (PQN), cyclic LOESS normalization, variance stabilizing
normalization (generalized log transformation), sum normalization, median
normalization, feature-based scaling (e.g., auto scaling, pareto scaling, level
scaling, and etc.).
## Apply sum normalization to a poplin object res <- poplin_normalize(x = faahko_poplin, method = "sum", xin = "knn", xout = "knn_pqn") ## Apply VSN normalization to a matrix m <- poplin_data(faahko_poplin, "knn") poplin_data(res, "knn_vsn") <- poplin_normalize(x = m, method = "vsn") poplin_data_list(res)
In metabolomics, dimension reduction methods are often used for modeling and
visualization. Currently, the poplin package supports three dimension-reduction
methods: principal component analysis (PCA), t-distributed stochastic neighbor
embedding (t-SNE), and partial least squares-discriminant analysis (PLS-DA). The
poplin_reduce
function perform dimension reduction of the data and store
the result to the poplinReduced
container.
empty <- faahko_poplin poplin_reduced_list(empty) <- list() poplin_reduced_names(empty) ## Apply PCA to a poplin object res <- poplin_reduce(x = empty, xin = "knn_cyclic", xout = "pca", method = "pca", ncomp = 3) ## Apply t-SNE to a matrix poplin_reduced(res, "tsne") <- poplin_reduce(m, method = "tsne", ncomp = 3, perplexity = 3) ## Apply PLS-DA to a poplin object y <- factor(colData(res)$sample_group, levels = c("WT", "KO")) res <- poplin_reduce(x = res, xin = "knn_cyclic", xout = "plsda", method = "plsda", y = y, ncomp = 3)
The poplin_reduce()
function returns the result containing custom attributes
that are used for summary and visualization. Please refer to [Visualization] for
details.
summary(poplin_reduced(res, "pca")) summary(poplin_reduced(res, "tsne")) summary(poplin_reduced(res, "plsda"))
The poplin package provide common visualization for metabolomics data based on
r CRANpkg("ggplot2")
. The plot functions in poplin package can be applied
either to a poplin
object or a matrix
.
The poplin_naplot()
helps to visually inspect missingness of the data.
poplin_naplot(x = faahko_poplin, xin = "raw")
The poplin_corplot()
visualizes correlations between samples (or features) to
quickly check the grouping structure in the data.
poplin_corplot(x = faahko_poplin, xin = "knn_cyclic")
The poplin_boxplot()
produces a box-and-whisker plot of the feature intensity
values across the samples.
group <- colData(faahko_poplin)$sample_group ## distribution of intensities before normalization poplin_boxplot(faahko_poplin, xin = "knn", group = group, pre_log2 = TRUE) ## distribution of intensities after cyclic LOESS normalization poplin_boxplot(faahko_poplin, xin = "knn_cyclic", group = group)
The poplin_scoreplot()
function visualizes the data onto a lower-dimensional
space using the poplin_reduce()
output.
group <- colData(faahko_poplin)$sample_group ## PCA output poplin_scoreplot(faahko_poplin, xin = "pca", group = group, ellipse = TRUE) ## PLS-DA output poplin_scoreplot(faahko_poplin, xin = "plsda", ellipse = TRUE, label = TRUE)
The poplin_biplot()
function Visualize an overlay of a score plot and a
loading plot using the poplin_reduce()
output.
group <- colData(faahko_poplin)$sample_group poplin_biplot(faahko_poplin, xin = "pca", group = group, ellipse = TRUE, arrow_col = "orange", arrow_label_subset = c("FT039", "FT042", "FT046", "FT098"), arrow_label_col = "gray20")
sessionInfo()
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.