BiocStyle::markdown()
knitr::opts_chunk$set( collapse = TRUE, comment = "#>" )
Package: r Biocpkg("peakPantheR")
Authors: Arnaud Wolfer, Goncalo Correia
## Silently loading all packages library(BiocStyle) library(peakPantheR) library(faahKO) library(pander) library(doParallel) library(foreach)
The peakPantheR
package is designed for the detection, integration and
reporting of pre-defined features in MS files (e.g. compounds, fragments,
adducts, ...).
The Parallel Annotation is set to detect and integrate multiple compounds in multiple files in parallel and store results in a single object. It can be employed to integrate a large number of expected features across a dataset.
Using the r Biocpkg("faahKO")
raw MS dataset as an example, this vignette
will:
r Biocpkg("faahKO")
datasetParallel compound integration is set to process multiple compounds in multiple files in parallel, and store results in a single object.
knitr::include_graphics("../man/figures/parallelAnnotation.png")
To achieve this, peakPantheR
will:
knitr::include_graphics("../man/figures/parallelAnnotation_procedure.png")
Diagram of the workflow and functions used for parallel annotation.
We can target 2 pre-defined features in 6 raw MS spectra file from the
r Biocpkg("faahKO")
package using peakPantheR_parallelAnnotation()
. For more
details on the installation and input data employed, please consult the
Getting Started with peakPantheR vignette.
First the paths to 3 MS file from the r Biocpkg("faahKO")
are located and used
as input spectras. In this example these 3 samples are considered as
representative of the whole run (e.g. Quality Control samples):
library(faahKO) ## file paths input_spectraPaths <- c(system.file('cdf/KO/ko15.CDF', package = "faahKO"), system.file('cdf/KO/ko16.CDF', package = "faahKO"), system.file('cdf/KO/ko18.CDF', package = "faahKO")) input_spectraPaths
Two targeted features (e.g. compounds, fragments, adducts, ...) are defined and stored in a table with as columns:
cpdID
(numeric)cpdName
(character)rtMin
(sec)rtMax
(sec)rt
(sec, optional / NA
)mzMin
(m/z)mzMax
(m/z)mz
(m/z, optional / NA
)# targetFeatTable input_targetFeatTable <- data.frame(matrix(vector(), 2, 8, dimnames=list(c(), c("cpdID", "cpdName", "rtMin", "rt", "rtMax", "mzMin", "mz", "mzMax"))), stringsAsFactors=FALSE) input_targetFeatTable[1,] <- c("ID-1", "Cpd 1", 3310., 3344.888, 3390., 522.194778, 522.2, 522.205222) input_targetFeatTable[2,] <- c("ID-2", "Cpd 2", 3280., 3385.577, 3440., 496.195038, 496.2, 496.204962) input_targetFeatTable[,c(3:8)] <- sapply(input_targetFeatTable[,c(3:8)], as.numeric)
# use pandoc for improved readability input_targetFeatTable <- data.frame(matrix(vector(), 2, 8, dimnames=list(c(), c("cpdID", "cpdName", "rtMin", "rt", "rtMax", "mzMin", "mz", "mzMax"))), stringsAsFactors=FALSE) input_targetFeatTable[1,] <- c("ID-1", "Cpd 1", 3310., 3344.888, 3390., 522.194778, 522.2, 522.205222) input_targetFeatTable[2,] <- c("ID-2", "Cpd 2", 3280., 3385.577, 3440., 496.195038, 496.2, 496.204962) input_targetFeatTable[,c(3:8)] <- sapply(input_targetFeatTable[,c(3:8)], as.numeric) rownames(input_targetFeatTable) <- NULL pander::pandoc.table(input_targetFeatTable, digits = 9)
Additional compound and spectra metadata can be provided but isn't employed during the fitting procedure:
# spectra Metadata input_spectraMetadata <- data.frame(matrix(c("sample type 1", "sample type 2", "sample type 1"), 3, 1, dimnames=list(c(),c("sampleType"))), stringsAsFactors=FALSE)
# use pandoc for improved readability input_spectraMetadata <- data.frame(matrix(c("sample type 1", "sample type 2", "sample type 1"), 3, 1, dimnames=list(c(),c("sampleType"))), stringsAsFactors=FALSE) pander::pandoc.table(input_spectraMetadata)
A peakPantheRAnnotation
object is first initialised with the path to the files
to process (spectraPaths
), features to integrate (targetFeatTable
) and
additional information and parameters such as spectraMetadata
, uROI
, FIR
and if they should be used (useUROI=TRUE
, useFIR=TRUE
):
library(peakPantheR) init_annotation <- peakPantheRAnnotation(spectraPaths = input_spectraPaths, targetFeatTable = input_targetFeatTable, spectraMetadata = input_spectraMetadata)
The resulting peakPantheRAnnotation
object is not annotated, does not contain
and use uROI
and FIR
init_annotation
peakPantheR_parallelAnnotation()
will run the annotation across files in
parallel (if ncores
>0) and return the successful annotations
(result$annotation
) and failures (result$failures
):
# annotate files serially annotation_result <- peakPantheR_parallelAnnotation(init_annotation, ncores=0, curveModel='skewedGaussian', verbose=TRUE) # successful fit nbSamples(annotation_result$annotation) data_annotation <- annotation_result$annotation data_annotation # list failed fit annotation_result$failures
Based on the fit results, updated ROI (uROI
) and fallback integration region
(FIR
) can be automatically determined using annotationParamsDiagnostic()
:
uROI
are established as the min/max (rt
and m/z
) of the found peaks
(+/- 5% in RT)FIR
are established as the median of found rtMin
, rtMax
, mzMin
,
mzMax
updated_annotation <- annotationParamsDiagnostic(data_annotation, verbose=TRUE) # uROI now exist updated_annotation
outputAnnotationDiagnostic()
will save to disk
annotationParameters_summary.csv
containing the original ROI
and newly
determined uROI
and FIR
for manual validation. Additionnaly a diagnostic
plot for each compound is saved for reference and can be generated in parallel
with the argument ncores
:
# create a colourScale based on the sampleType uniq_sType <- sort(unique(spectraMetadata(updated_annotation)$sampleType), na.last=TRUE) col_sType <- unname( setNames(c('blue', 'red'), c(uniq_sType))[spectraMetadata(updated_annotation)$sampleType] ) # create a temporary location to save the diagnotic (otherwise provide the path # to the selected location) output_folder <- tempdir() # output fit diagnostic to disk outputAnnotationDiagnostic(updated_annotation, saveFolder=output_folder, savePlots=TRUE, sampleColour=col_sType, verbose=TRUE, ncores=2)
The data saved in annotationParameters_summary.csv
is as follow:
# use pandoc for improved readability, display the diagnostic results tmp_csv <- data.frame(matrix(nrow=2,ncol=21,dimnames=list(c(), c('cpdID', 'cpdName', 'X', 'ROI_rt', 'ROI_mz','ROI_rtMin', 'ROI_rtMax', 'ROI_mzMin', 'ROI_mzMax', 'X', 'uROI_rtMin', 'uROI_rtMax', 'uROI_mzMin', 'uROI_mzMax', 'uROI_rt', 'uROI_mz', 'X', 'FIR_rtMin', 'FIR_rtMax', 'FIR_mzMin', 'FIR_mzMax'))), stringsAsFactors=FALSE) tmp_csv[1,] <- c('ID-1','Cpd 1', '|', 3344.888, 522.2, 3310., 3390., 522.194778, 522.205222,'|', 3305.75893, 3411.436284, 522.194778, 522.205222, 3344.888, 522.2, '|', 3326.10635, 3407.272648, 522.194778, 522.205222) tmp_csv[2,] <- c('ID-2','Cpd 2', '|', 3385.577, 496.2, 3280., 3440., 496.195038, 496.204962,'|',3337.376665, 3462.449033, 496.195038, 496.204962, 3385.577, 496.2, '|', 3365.023857, 3453.404957, 496.195038, 496.204962) tmp_csv[,-c(1,2,3,10,17)] <- sapply(tmp_csv[,-c(1,2,3,10,17)], as.numeric) colnames(tmp_csv) <- c('cpdID', 'cpdName', 'X', 'ROI_rt', 'ROI_mz','ROI_rtMin', 'ROI_rtMax', 'ROI_mzMin', 'ROI_mzMax', 'X', 'uROI_rtMin', 'uROI_rtMax', 'uROI_mzMin', 'uROI_mzMax', 'uROI_rt', 'uROI_mz', 'X', 'FIR_rtMin', 'FIR_rtMax', 'FIR_mzMin', 'FIR_mzMax') pander::pandoc.table(tmp_csv, digits=9)
knitr::include_graphics( "../man/figures/parallel_annotation_diagnostic_cpd1.png")
Diagnostic plot for compound 1: The top panel is an overlay of the extracted EIC across all samples with the fitted curve as dotted line. The panel under the EIC represent each found peak RT peakwidth (
rtMin
,rtMax
and apex marked as dot), ordered with the first sample at the top. The bottom 3 panels represent foundRT
(peakwidth),m/z
(peakwidth) andpeak area
by run order, with the corresponding histograms to the right
ROI
exported to .csv
can be updated based on the diagnostic plots; uROI
(updated ROI potentially used for all samples) and FIR
(fallback integration
regions for when no peak is found) can also be tweaked to better fit the peaks.
The optional retentionTimeCorrection()
method provides an interface to adjust
the expected ROI rt values and account for chromatographic batch effects. By
comparing expected and found rt values for a set of reference compounds, a model
of the chromatographic shift for the present batch can be established. This
model can be in turned used to correct the expected retention time of all
targeted compounds.
In order to apply this method, the peakPantheRAnnotation
must be previously
annotated (isAnnotated=TRUE
).
The retention time correction algorithm to use can be selected using the
method
argument (currently polynomial
and
constant
methods are available).
retentionTimeCorrection()
fits a correction function to model the dependency
of the mean rt_dev_sec
per reference feature with the expected databased
retention time.
If useUROI=TRUE
, the expected retention time value is taken from the UROI_rt
field, otherwise ROI_rt
is used.
If robust=TRUE
, the RANSAC algorithm is used to automatically detect outliers
and exclude them from the fit (this should only be used with a large number of
reference features).
retentionTimeCorrection()
returns a list with 2 elements:
a modified peakPantheRAnnotation
object
a ggplot2
diagnostic plot (optional, depending on whether TRUE
or FALSE
is passed to the diagnostic
argument).
The returned peakPantheRAnnotation
object contains the same uROI
and FIR
mz
values as the original annotation, but the retention time related
parameters (rt
, rtMin
and rtMax
) are replaced by the adjusted values.
The rtMax
and rtMin
are set as the corrected rt
value plus or minus half
the value passed to the rtWindowWidth
argument, respectively.
useUROI
is also set to TRUE.
To continue with the workflow, simply set a new annotation object with the
fit parameters established by retentionTimeCorrection()
and call
peakPantheR_parallelAnnotation()
for the final annotation.
# Example with constant correction. rtCorrectionOutput <- retentionTimeCorrection(updated_annotation, rtCorrectionReferences=c('ID-1'), method='constant', robust=FALSE, rtWindowWidth=15, diagnostic=TRUE) updated_annotation <- rtCorrectionOutput$annotation # The ggplot2 plot object rtCorrectionOutput$plot # Example with second degree polynomial, without using RANSAC # # to obtain a robust fit rtCorrectionOutput <- retentionTimeCorrection(updated_annotation, rtCorrectionReferences=NULL, method='polynomial', params=list(polynomialOrder=2), robust=FALSE, rtWindowWidth=15, diagnostic=TRUE)
Following this manual validation of the fit on reference samples, the modified
parameters in the .csv
file can be reloaded and applied to all study samples.
peakPantheR_loadAnnotationParamsCSV()
will load the new .csv
parameters (as
generated by outputAnnotationDiagnostic()
) and initialise a
peakPantheRAnnotation
object without spectraPaths
, spectraMetadata
or
cpdMetadata
which will need to be added before annotation. useUROI
and
useFIR
are set to FALSE
by default and will need to be modified according to
the analysis to run. uROIExist
is established depending on the .csv
uROI
column, and will only be set to TRUE if no NA
are present. It is possible to
reset the FIR
values with the uROI
windows using resetFIR()
.
update_csv_path <- '/path_to_new_csv/' # load csv new_annotation <- peakPantheR_loadAnnotationParamsCSV(update_csv_path) #> uROIExist set to TRUE #> New peakPantheRAnnotation object initialised for 2 compounds new_annotation #> An object of class peakPantheRAnnotation #> 2 compounds in 0 samples. #> updated ROI exist (uROI) #> does not use updated ROI (uROI) #> does not use fallback integration regions (FIR) #> is not annotated new_annotation <- resetFIR(new_annotation) #> FIR will be reset with uROI values
Now that the fit parameters were set on 3 representative samples (e.g. QC), the
same processing can be applied to all study samples. resetAnnotation()
will
reinitialise all the results and modify the samples or compounds targeted as
required:
## new files new_spectraPaths <- c(system.file('cdf/KO/ko15.CDF', package = "faahKO"), system.file('cdf/WT/wt15.CDF', package = "faahKO"), system.file('cdf/KO/ko16.CDF', package = "faahKO"), system.file('cdf/WT/wt16.CDF', package = "faahKO"), system.file('cdf/KO/ko18.CDF', package = "faahKO"), system.file('cdf/WT/wt18.CDF', package = "faahKO")) new_spectraPaths
Below we define the metadata of these new samples:
## new spectra metadata new_spectraMetadata <- data.frame(matrix(c("KO", "WT", "KO", "WT", "KO", "WT"), 6, 1, dimnames=list(c(), c("Group"))), stringsAsFactors=FALSE)
# use pandoc for improved readability new_spectraMetadata <- data.frame(matrix(c("KO", "WT", "KO", "WT", "KO", "WT"), 6, 1, dimnames=list(c(), c("Group"))), stringsAsFactors=FALSE) pander::pandoc.table(new_spectraMetadata)
new_annotation <- resetAnnotation(updated_annotation, spectraPaths=new_spectraPaths, spectraMetadata=new_spectraMetadata, useUROI=TRUE, useFIR=TRUE, verbose=FALSE)
## add new samples to the annotation loaded from csv, useUROI, useFIR new_annotation <- resetAnnotation(new_annotation, spectraPaths=new_spectraPaths, spectraMetadata=new_spectraMetadata, useUROI=TRUE, useFIR=TRUE) #> peakPantheRAnnotation object being reset: #> Previous "ROI", "cpdID" and "cpdName" value kept #> Previous "uROI" value kept #> Previous "FIR" value kept #> Previous "cpdMetadata" value kept #> New "spectraPaths" value set #> New "spectraMetadata" value set #> Previous "uROIExist" value kept #> New "useUROI" value set #> New "useFIR" value set
new_annotation
We can now run the final annotation on all samples with the optimised targeted features:
# annotate files serially new_annotation_result <- peakPantheR_parallelAnnotation(new_annotation, ncores=0, verbose=FALSE) # successful fit nbSamples(new_annotation_result$annotation) final_annotation <- new_annotation_result$annotation final_annotation # list failed fit new_annotation_result$failures
The final fits can be saved to disk with outputAnnotationDiagnostic()
:
# create a colourScale based on the sampleType uniq_group <- sort(unique(spectraMetadata(final_annotation)$Group),na.last=TRUE) col_group <- unname( setNames(c('blue', 'red'), c(uniq_sType))[spectraMetadata(final_annotation)$Group] ) # create a temporary location to save the diagnotic (otherwise provide the path # to the selected location) final_output_folder <- tempdir() # output fit diagnostic to disk outputAnnotationDiagnostic(final_annotation, saveFolder=final_output_folder, savePlots=TRUE, sampleColour=col_group, verbose=TRUE)
For each processed sample, a peakTables
contains all the fit information for
all compounds targeted. annotationTable( , column)
will group the values
across all samples and compounds for any peakTables
column:
# peakTables for the first sample peakTables(final_annotation)[[1]]
# use pandoc for improved readability pander::pandoc.table(peakTables(final_annotation)[[1]])
# Extract the found peak area for all compounds and all samples annotationTable(final_annotation, column='peakArea')
# use pandoc for improved readability pander::pandoc.table(annotationTable(final_annotation, column='peakArea'))
Finally all annotation results can be saved to disk as .csv
with
outputAnnotationResult()
. These .csv
will contain the compound metadata,
spectra metadata and a file for each column of peakTables (with samples as rows
and compounds as columns):
# create a temporary location to save the diagnotic (otherwise provide the path # to the selected location) final_output_folder <- tempdir() # save outputAnnotationResult(final_annotation, saveFolder=final_output_folder, annotationName='ProjectName', verbose=TRUE) #> Compound metadata saved at /final_output_folder/ProjectName_cpdMetadata.csv #> Spectra metadata saved at #> /final_output_folder/ProjectName_spectraMetadata.csv #> Peak measurement "found" saved at /final_output_folder/ProjectName_found.csv #> Peak measurement "rtMin" saved at /final_output_folder/ProjectName_rtMin.csv #> Peak measurement "rt" saved at /final_output_folder/ProjectName_rt.csv #> Peak measurement "rtMax" saved at /final_output_folder/ProjectName_rtMax.csv #> Peak measurement "mzMin" saved at /final_output_folder/ProjectName_mzMin.csv #> Peak measurement "mz" saved at /final_output_folder/ProjectName_mz.csv #> Peak measurement "mzMax" saved at /final_output_folder/ProjectName_mzMax.csv #> Peak measurement "peakArea" saved at #> /final_output_folder/ProjectName_peakArea.csv #> Peak measurement "maxIntMeasured" saved at #> /final_output_folder/ProjectName_maxIntMeasured.csv #> Peak measurement "maxIntPredicted" saved at #> /final_output_folder/ProjectName_maxIntPredicted.csv #> Peak measurement "is_filled" saved at #> /final_output_folder/ProjectName_is_filled.csv #> Peak measurement "ppm_error" saved at #> /final_output_folder/ProjectName_ppm_error.csv #> Peak measurement "rt_dev_sec" saved at #> /final_output_folder/ProjectName_rt_dev_sec.csv #> Peak measurement "tailingFactor" saved at #> /final_output_folder/ProjectName_tailingFactor.csv #> Peak measurement "asymmetryFactor" saved at #> /final_output_folder/ProjectName_asymmetryFactor.csv #> Summary saved at /final_output_folder/ProjectName_summary.csv
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.