knitr::opts_chunk$set(fig.width=6, fig.height=6, fig.path='figures/')
The objective of the w4m2bioc package is to facilitate the handling of preprocessed data and metadata (i.e., after the XCMS and CAMERA steps in metabolomics) between the Galaxy-based Workflow4metabolomics infrastructure [@Giacomoni2015] and the R environment [@RCoreTeam2016]. Preprocessed data and metadata are handled by the Galaxy modules from the Workflow4Metabolomics infrastructure as three tabulated .tsv files. Within R such data and metadata can be conveniently handled in an ExpressionSet object from the Biobase bioconductor package [@Hubert2015]. The w4m2bioc package thus provides function and methods to import/export the three .tsv files into/from an ExpressionSet object.
Preprocessed data and metadata within the Workflow4metabolomics infrastructure consists of three tabulated .tsv files:
a numerical matrix of intensities (dataMatrix; variables x samples),
a data frame containing the sample metadata (sampleMetadata; samples x sample metadata),
a data frame containing the variable metadata (variableMetadata; variables x variable metadata).
There is no constraint regarding the content of the sampleMetadata and variableMetadata columns, to allow maximum flexibility with different types of omic data sets. The only constraints are that all three tables have row and column names (without duplicated or missing values) and that there is an exact match between the row names of the dataMatrix and sampleMetadata (sample names) on one hand, and between the column names of the dataMatrix and the row names of the variableMetadata (variable names) on the other hand.
The ExpressionSet class from Bioconductor includes three slots which can be used to store these tables: the assayData, the phenoData, and the featureData.
Let us first load the package:
library(w4m2bioc)
We can then build the sacSet object by reading the 3 tables containing the data intensities (dataMatrix), and the sample and variable metadata (sampleMetadata and variableMetadata, respectively):
'sacurine-dataMatrix.tsv'
'sacurine-sampleMetadata.tsv'
'sacurine-variableMetadata.tsv'
You can have a look at these tabular files with Excel since they are in the extdata folder of the installed w4m2bioc package.
We use the readw4m function to build the object which will contain the 3 tables (one matrix of numerics and two data frames):
sacSet <- readw4m(file.path(path.package("w4m2bioc"), "extdata")) sacSet
Notes:
A warning message is printed when some variable (or samples) names in the initial tables are not syntactically correct for R (here the variable names in the dataMatrix.tsv and variableMetadata.tsv files in the package have already been formatted with the make.names function). The warning message can be hidden with the verboseL = FALSE argument. If duplicates are present, the call to readw4m generates an error.
The sample and variable names can be accessed and modified, using the sampleNames and featureNames accessor from the ExpressionSet object:
library(Biobase) varNamesVc <- featureNames(sacSet) featureNames(sacSet) <- make.names(varNamesVc)
checkw4m(sacSet)
We can access the dataMatrix, sampleMetadata and variableMetadata from the ExpressionSet by using the exprs, pData, and fData methods, respectively. Suppose for instance that we want to transform the intensities back to the arithmetic scale (they have been log10 transformed in the 'dataMatrix.tsv' file):
sacDataMN <- exprs(sacSet) sacArithDataMN <- 10^sacDataMN sacArithSet <- sacSet exprs(sacArithSet) <- sacArithDataMN checkw4m(sacSet)
Notes:
In the data matrix exported from the ExpressionSet, the samples are stored as columns.
The compatibility of the dimensions and sample/variable names of the new data matrix are not automatically checked during the replacement: hence we check the integrity of the object afterwards.
We can also subset the samples and/or variables. Suppose that we would like to restrict the dataset to the female volunteers:
sacSamDF <- pData(sacSet) sacGenderVc <- sacSamDF[, "gender"] table(sacGenderVc) femaleVl <- sacGenderVc == "F" sacFemaleSet <- sacSet[, femaleVl] sacFemaleSet
Multivariate analysis (e.g. PLS-DA) can be performed on our ExpressionSet object by using the ropls bioconductor package [@Thevenot2015] and indicating the name of the column of the sample metadata to be used as the response:
library(ropls) sacGenderPlsda <- opls(sacSet, "gender")
library(ropls) sacGenderPlsda <- opls(sacSet, "gender", plotL = FALSE) layout(matrix(1:4, nrow = 2, byrow = TRUE)) for(typeC in c("overview", "outlier", "x-score", "x-loading")) plot(sacGenderPlsda, typeVc = typeC, parDevNewL = FALSE)
Should we export our ExpressionSet objet back the W4M 3 tabulated file formats, we use the writew4m method:
writew4m(sacFemaleSet, filePrefixC = file.path(getwd(), "sacFemale_"))
Here is the output of sessionInfo()
on the system on which this document was
compiled:
sessionInfo()
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.