knitr::opts_chunk$set( collapse = TRUE, comment = "#>" )
library(MSstatsSampleSize)
This vignette introduces all the functionalities and summarizes their options in MSstatsSampleSize. MSstatsSampleSize requires protein abundances quantified in mass spectrometry runs as a matrix (columns for biological replicates (samples) and rows for proteins) and annotation including Run (Ms runs), biological replicates (samples) and their condition (such as a disease and time points). MSstatsSampleSize includes the following functionalities:
estimateVar
: estimates the variance across biological replicates and MS run
for each protein.simulateDataset
: simulates data with the given number(s) of biological
replicates based on the variance estimation. designSampleSizeClassification
: fit the classification model (five
classifier are provided) on each simulation and reports the mean predictive
accuracy of the classifier and mean protein importance over multiple iterations
of the simulation. The sample size per condition, which generates the largest
predictive accuracy, is estimated, while varying the number of biological
replicates to simulate. Also, the proteins, which can classify the conditions
best, are reported. The reported sample size per condition can be used to design
future experiments.In addition, MSstatsTMT includes the following visualization plots for sample size estimation:
meanSDplot
: draw the plot for the mean protein abundance vs standard
deviation in each condition for the input preliminary dataset. It can exhibit
the quality of input data.designSampleSizePCAplot
: make PCA plots for the simulated data with certain
sample size.designSampleSizeHypothesisTestingPlot
: visualize sample size calculation in
hypothesis testing, which estimates the minimal required number of replicates
under different expected fold changes.designSampleSizeClassificationPlots
: visualize sample size calculation in
classification, including predictive accuracy plot and protein importance plot
for different sample sizes. The function fits intensity-based linear model on the preliminary data. This function outputs variance components and mean abundance for each protein.
data
: Data matrix with protein abundance. Rows are proteins and columns are
Biological replicates or samples.annotation
: Group information for samples in data. Run
for MS run,
BioReplicate
for biological subject ID and Condition
for group information
are required. Run
information should be the same with the column of data
.
Multiple Run
may come from same BioReplicate
.
log2Trans
: Logical value, if TRUE
input data
is log-transformed with baseFALSE
data("OV_SRM_train") data("OV_SRM_train_annotation") head(OV_SRM_train)[,1:5] head(OV_SRM_train_annotation) # estimate the mean protein abundance and variance in each condition variance_estimation <- estimateVar(data = OV_SRM_train, annotation = OV_SRM_train_annotation, log2Trans = FALSE) # the mean protein abundance in each condition head(variance_estimation$mu) # the standard deviation in each condition head(variance_estimation$sigma) # the mean protein abundance across all the conditions head(variance_estimation$promean) # the standard deviation across all the conditions head(variance_estimation$prosd)
This function draws the plot for the mean protein abundance (X-axis) vs standard
deviation (Y-axis) in each condition. The lowess
function is used to fit the
LOWESS smoother between mean protein abundance and standard deviation (square
root of variance). This function generates the mean-SD plot or a pdf file based
on the selected inputs.
data
: A list with mean protein abundance matrix and standard deviation
matrix. It should be the output of estimateVar
function.x.axis.size
: Size of x-axis labeling in Mean-SD Plot. Default is 10.y.axis.size
: Size of y-axis labels. Default is 10.smoother_size
: Size of lowess smoother. Default is 1.width
: Width of the saved pdf file. Default is 4.height
: Height of the saved pdf file. Default is 4.xlimUp
: The upper limit of x-axis for mean-SD plot. Default is 30.ylimUp
: The upper limit of y-axis for mean-SD plot. Default is 3.address
: The name of folder that will store the results. Default folder is
the current working directory. The other assigned folder has to be existed under
the current working directory. An output pdf file is automatically created with
the default name of MeanSDPlot.pdf
. The command address can help to specify
where to store the file as well as how to modify the beginning of the file name.
If address=FALSE, plot will be not saved as pdf file but showed in window.# output a pdf file with mean-SD plot meanSDplot(variance_estimation)
based on the variance estimation
This function simulate datasets with the given numbers of biological replicates
and proteins based on the preliminary dataset (input for this function). The
function fits intensity-based linear model on the input data data
in order to
get variance and mean abundance, using estimateVar
function. Then it uses
variance components and mean abundance to simulate new training data with the
given sample size and protein number. It outputs the number of simulated proteins,
a vector with the number of simulated samples in each condition, the list of
simulated training datasets, the input dataset and the (simulated) validation
dataset.
data
: Protein abundance data matrix. Rows are proteins and columns are biological replicates (samples).annotation
: Group information for samples in data.mRun' for MS run,
BioReplicate' for biological subject ID and Condition' for group information are required.
Run' information should be the same with the column of data'. Multiple
Run' may come from same `BioReplicate'.log2Trans
: Default is FALSE. If TRUE, the input `data' is log-transformed with base 2.num_simulations
: Number of times to repeat simulation experiments (Number of simulated datasets). Default is 10.samples_per_group
: Number of samples per group to simulate. Default is 50.protein_rank
: The standard to rank the proteins in the input data'. It can be 1) "mean" of protein abundances over all the samples or 2) "sd" (standard deviation) of protein abundances over all the samples or 3) the "combined" of mean abundance and standard deviation.
The proteins in the input
data' are ranked based on `protein_rank' and the user can select a subset of proteins to simulate.protein_select
: select proteins with "low" or "high" mean abundance or standard deviation (variance) or their combination to simulate. If protein_order = "mean"' or protein_order = "sd"',
protein_select' should be "low" or "high". Default is "high", indicating high abundance or standard deviation proteins are selected to simulate.
If protein_order = "combined"',
protein_select' has two elements. The first element corrresponds to the mean abundance. The second element corrresponds to the standard deviation (variance). Default is c("high", "low") (select proteins with high abundance and low variance).protein_quantile_cutoff
: Quantile cutoff(s) for selecting protiens to simulate. For example, when protein_rank="mean"', and
protein_select="high"', protein_quantile_cutoff=0.1' Proteins are ranked based on their mean abundance across all the samples. Then, the top 10% highest abundant proteins are selected to simulate. Default is 0.0, which means that all the proteins are used. If
protein_rank = "combined"', protein_quantile_cutoff'
has two cutoffs. The first element corrresponds to the cutoff for mean abundance. The second element corrresponds to the cutoff for the standard deviation (variance). Default is c(0.0, 1.0), which means that all the proteins will be used.expected_FC
: Expected fold change of proteins. The first option (Default) is "data", indicating the fold changes are directly estimated from the input data'. The second option is a vector with predefined fold changes of listed proteins. The vector names must match with the unique information of Condition in
annotation'. One group must be selected as a baseline and has fold change 1 in the vector. The user should provide list_diff_proteins, which users expect to have the fold changes greater than 1. Other proteins that are not available in `list_diff_proteins' will be expected to have fold change = 1list_diff_proteins
: Vector of proteins names which are set to have fold changes greater than 1 between conditions. If user selected `expected_FC= "data" ', this should be NULL.simulate_validation
: Default is FALSE. If TRUE, simulate the validation set; otherwise, the input `data' will be used as the validation set.valid_samples_per_group
: Number of validation samples per group to simulate. This option works only when user selects `simulate_validation=TRUE'. Default is 50.# expected_FC = "data": fold change estimated from OV_SRM_train # select_simulated_proteins = "proportion": select the simulated proteins based #on the proportion of total proteins # simulate_valid = FALSE: use input OV_SRM_train as validation set simulated_datasets <- simulateDataset(data = OV_SRM_train, annotation = OV_SRM_train_annotation, log2Trans = FALSE, num_simulations = 10, # simulate 10 times samples_per_group = 50, # 50 samples per condition protein_rank = "mean", protein_select = "high", protein_quantile_cutoff = 0.0, expected_FC = "data", list_diff_proteins = NULL, simulate_validation = FALSE, valid_samples_per_group = 50)
Explore the output from simulateDataset
function
# the number of simulated proteins simulated_datasets$num_proteins # a vector with the number of simulated samples in each condition simulated_datasets$num_samples # the list of simulated protein abundance matrices # Each element of the list represents one simulation head(simulated_datasets$simulation_train_Xs[[1]]) # first simulation # the list of simulated condition vectors # Each element of the list represents one simulation head(simulated_datasets$simulation_train_Ys[[1]]) # first simulation
User can also specify the expected fold change of proteins they consider to be deferentially abundant between conditions.
# expected_FC = expected_FC: user defined fold change unique(OV_SRM_train_annotation$Condition) expected_FC <- c(1, 1.5) names(expected_FC) <- c("control", "ovarian cancer") set.seed(1212) # Here I randomly select some proteins as differential to show how the function works # The user should prepare this list of differential proteins by themselves diff_proteins <- sample(rownames(OV_SRM_train), 20) simualted_datasets_predefined_FC <- simulateDataset(data = OV_SRM_train, annotation = OV_SRM_train_annotation, log2Trans = FALSE, num_simulations = 10, # simulate 10 times samples_per_group = 50, # 50 samples per condition protein_rank = "mean", protein_select = "high", protein_quantile_cutoff = 0.0, expected_FC = expected_FC, list_diff_proteins = diff_proteins, simulate_validation = FALSE, valid_samples_per_group = 50)
This function fits the classification model, in order to classify the subjects
in the simulated training datasets (in the output of simulatedDataset
). Then
the fitted model is validated by the (simulated) validation set. Two performance
are reported :
(1) the mean predictive accuracy : The function trains classifier on each
simulated training dataset and reports the predictive accuracy of the trained
classifier on the validation data (output of SimulateDataset
function). Then
these predictive accuracies are averaged over all the simulation.
(2) the mean protein importance : It represents the importance of a protein in
separating different groups. It is estimated on each simulated training dataset
using function varImp
from package caret. Please refer to the help file of
varImp
about
how each classifier calculates the protein importance. Then these importance
values for each protein are averaged over all the simulation.
The list of classification models trained on each simulated dataset, the predictive accuracy on the validation set predicted by the corresponding classification model and the importance value for all the proteins estimated by the corresponding classification model are also reported.
simulations
: A list of simulated datasets It should be the name of the
output of SimulateDataset
function.classifier
: A string specifying which classfier to use. This function uses
function train
from package caret. The options are 1) rf (random forest
classifier, default option). 2) nnet (neural network), 3) svmLinear (support
vector machines with linear kernel), 4) logreg(logistic regression), and 5)
naive_bayes (naive_bayes).parallel
: Default is FALSE. If TRUE, parallel computation is performed.classification_results <- designSampleSizeClassification( simulations = simulated_datasets, parallel = FALSE)
Explore the output of designSampleSizeClassification
# the number of simulated proteins classification_results$num_proteins # a vector with the number of simulated samples in each condition classification_results$num_samples # the mean predictive accuracy over all the simulated datasets, # which have same 'num_proteins' and 'num_samples' classification_results$mean_predictive_accuracy # the mean protein importance vector over all the simulated datasets, # the length of which is 'num_proteins'. head(classification_results$mean_feature_importance)
In order to speed up the running time, the package also provides parallel
computation for designSampleSizeClassification
function.
## try parallel computation to speed up ## The parallel computation may cause error while allocating the core resource ## Then the users can try the above function without parallel computation classification_results_parallel <- designSampleSizeClassification( simulations = simulated_datasets, parallel = TRUE)
This function visualizes for sample size calculation in classification. Mean
predictive accuracy and mean protein importance under each sample size is from
the input data
, which is the output from function designSampleSizeClassification
.
To illustrate the mean predictive accuracy and protein importance under different
sample sizes, it generates two types of plots in pdf files as output :
(1) The predictive accuracy plot shows the mean predictive accuracy under different sample sizes. The X-axis represents different sample sizes and y-axis represents the mean predictive accuracy.
(2) The protein importance plot includes multiple subplots. The number of
subplots is equal to list_samples_per_group
. Each subplot shows the top
num_important_proteins_show
most important proteins under each sample size.
The Y-axis of each subplot is the protein name and X-axis is the mean protein
importance under the sample size.
(3) A numeric value which is the estimated optimal sample size per group for the input dataset for classification problem.
While varying the number of biological replicates to simulate, the sample size per condition which generates the largest predictive accuracy can be found from the predictive accuracy plot, The optimal sample size per condition can be used to design future experiments. Also, the proteins, which can classify the conditions best, are reported by the protein importance plot.
data
: A list of outputs from function designSampleSizeClassification
.
Each element represents the results under a specific sample size. The input
should include at least two simulation results with different sample sizes.list_samples_per_group
: A vector includes the different sample sizes
simulated. This is required. The number of simulation in the input data
should
be equal to the length of list_samples_per_groupoptimal_threshold
: The maximal cutoff for deciding the optimal sample size.
Default is 0.0001. Large cutoff can lead to smaller optimal sample size
whereas small cutoff produces large optimal sample size.num_important_proteins_show
: The number of proteins to show in protein
importance plot.protein_importance_plot
: TRUE(default) draws protein importance plot.predictive_accuracy_plot
: TRUE(default) draws predictive accuracy plot.x.axis.size
: Size of x-axis labeling in predictive accuracy plot and protein
importance plot. Default is 10.y.axis.size
: Size of y-axis labels in predictive accuracy plot and protein
importance plot. Default is 10.predictive_accuracy_plot_width
: Width of the saved pdf file for predictive
accuracy plot. Default is 4.predictive_accuracy_plot_height
: Height of the saved pdf file for
predictive accuracy plot. Default is 4.protein_importance_plot_width
: Width of the saved pdf file for protein
importance plot. Default is 3.protein_importance_plot_height
: Height of the saved pdf file for protein
importance plot. Default is 3.ylimUp_predictive_accuracy
: The upper limit of y-axis for predictive
accuracy plot. Default is 1. The range should be 0 to 1.ylimDown_predictive_accuracy
: The lower limit of y-axis for predictive
accuracy plot. Default is 0.0. The range should be 0 to 1.address
: The name of folder that will store the results. Default folder is
the current working directory. The other assigned folder has to be existed under
the current working directory. An output pdf file is automatically created with
the default name of PredictiveAccuracyPlot.pdf
and ProteinImportancePlot.pdf
.
The command address can help to specify where to store the file as well as how
to modify the beginning of the file name. If address=FALSE, plot will be not
saved as pdf file but showed in window.#### sample size classification #### # simulate different sample sizes # 1) 10 biological replicates per group # 1) 25 biological replicas per group # 2) 50 biological replicas per group # 3) 100 biological replicas per group # 4) 200 biological replicas per group list_samples_per_group <- c(10, 25, 50, 100, 200) # save the simulation results under each sample size multiple_sample_sizes <- list() for(i in seq_along(list_samples_per_group)){ # run simulation for each sample size simulated_datasets <- simulateDataset(data = OV_SRM_train, annotation = OV_SRM_train_annotation, log2Trans = FALSE, num_simulations = 10, # simulate 10 times samples_per_group = list_samples_per_group[i], protein_rank = "mean", protein_select = "high", protein_quantile_cutoff = 0.0, expected_FC = "data", list_diff_proteins = NULL, simulate_valid = FALSE, valid_samples_per_group = 50) # run classification performance estimation for each sample size res <- designSampleSizeClassification(simulations = simualted_datasets, parallel = TRUE) # save results multiple_sample_sizes[[i]] <- res } ## make the plots designSampleSizeClassificationPlots(multiple_sample_sizes, list_samples_per_group, ylimUp_predictive_accuracy = 0.8, ylimDown_predictive_accuracy = 0.6)
The function fits intensity-based linear model on the input data
. Then it uses
the fitted models and the fold changes estimated from the models to calculate
sample size for hypothesis testing through designSampleSize
function from
MSstats package. It outputs a table with the minimal number of biological
replicates per condition to acquire the expected FDR and power under different
fold changes, and a PDF file with the sample size plot.
data
: Protein abundance data matrix.
Rows are proteins and columns are biological replicates (samples).annotation
: Group information for samples in data. Run
for MS run, BioReplicate
for biological subject ID and Condition
for group information are required. Run
information should be the same with the column of data
. Multiple Run
may come from same BioReplicate
.log2Trans
: Default is FALSE. If TRUE, the input `data' is log-transformed with base 2.desired_FC
: the range of a desired fold change. The first option (Default) is "data", indicating the range of the desired fold change is directly estimated from the input data
, which are the minimal fold change and the maximal fold change in the input `data'. The second option is a vector which includes the lower and upper values of the desired fold change (For example, c(1.25,1.75)).protein_rank
: The standard to rank the proteins in the input data'. It can be 1) "mean" of protein abundances over all the samples or 2) "sd" (standard deviation) of protein abundances over all the samples or 3) the "combined" of mean abundance and standard deviation. The proteins in the input
data' are ranked based on `protein_rank'
and the user can select a subset of proteins for hypothesis testing and sample size calculation.protein_select
: select proteins with "low" or "high" mean abundance or standard deviation (variance) or their combination for hypothesis testing and sample size calculation. The variance (and the range of desired fold change if desiredFC = "data") will be estimated from the selected proteins. If protein_order = "mean"' or protein_order = "sd"',
protein_select' should be "low" or "high". Default is "high", indicating high abundance or standard deviation proteins are selected. If protein_order = "combined"',
protein_select' has two elements. The first element corrresponds to the mean abundance. The second element corrresponds to the standard deviation (variance). Default is c("high", "low") (select proteins with high abundance and low variance).protein_quantile_cutoff
: Quantile cutoff(s) for selecting protiens for hypothesis testing and sample size calculation. For example, when protein_rank="mean"', and
protein_select="high"', protein_quantile_cutoff=0.1' Proteins are ranked based on their mean abundance across all the samples. Then, the top 10% highest abundant proteins are selected. Default is 0.0, which means that all the proteins are used. If
protein_rank = "combined"', protein_quantile_cutoff'
has two cutoffs. The first element corrresponds to the cutoff for mean abundance. The second element corrresponds to the cutoff for the standard deviation (variance). Default is c(0.0, 1.0), which means that all the proteins will be used.FDR
: a pre-specified false discovery ratio (FDR) to control the overall false positive. Default is 0.05.power
: a pre-specified statistical power which defined as the probability of detecting a true fold change. You should input the average of power you expect. Default is 0.9.width
: Width of the saved pdf file. Default is 5.height
: Height of the saved pdf file. Default is 5.address
: The name of folder that will store the results. Default folder is
the current working directory. The other assigned folder has to be existed under
the current working directory. An output pdf file is automatically created with
the default name of HypothesisTestingSampleSizePlot.pdf
. The command address
can help to specify where to store the file as well as how to modify the beginning
of the file name.
If address=FALSE, plot will be not saved as pdf file but showed in window.# output a pdf file with sample size calculation plot for hypothesis testing # also return a table which summaries the plot HT_res <- designSampleSizeHypothesisTestingPlot(data = OV_SRM_train, annotation= OV_SRM_train_annotation, desired_FC = "data", protein_rank = "mean", protein_select = "high", protein_quantile_cutoff = 0.0, FDR=0.05, power=0.9) # data frame with columns desiredFC, numSample, FDR, power and CV head(HT_res)
This function draws PCA plots for the preliminary dataset and each simulated
dataset in simulations
(input for this function). It outputs a pdf file where
the number of page is equal to the number of simulations plus 1. The first page
represents a PCA plot for the input data
(OV_SRM_train
). Each of the following
pages presents a PCA plot under one simulation. X-axis of PCA plot is the first
component and y-axis is the second component. This function can be used to
validate whether the simulated dataset looks consistent with the input
preliminary dataset.
simulations
: A list of simulated datasets. It should be the output of
simulateDataset
function.which.PCA
: Select one PCA plot to show. It can be "all", "allonly", or
"simulationX". X should be index of simulation, such as "simulation1" or
"simulation5". Default is "all", which generates all the plots. "allonly"
generates the PCA plot for the whole input dataset. "simulationX" generates
the PCA plot for a specific simulated dataset (given by index).x.axis.size
: Size of x-axis labeling in PCA Plot. Default is 10.y.axis.size
: Size of y-axis labels. Default is 10.dot.size
: Size of dots in PCA plot. Default is 3.legend.size
: Size of legend above Profile plot. Default is 7.width
: Width of the saved pdf file. Default is 6.height
: Height of the saved pdf file. Default is 5.address
: The name of folder that will store the results. Default folder is
the current working directory. The other assigned folder has to be existed under
the current working directory. An output pdf file is automatically created with
the default name of PCAPlot.pdf
. The command address can help to specify where
to store the file as well as how to modify the beginning of the file name.
If address=FALSE, plot will be not saved as pdf file but showed in window.# output a pdf file with 11 PCA plots designSampleSizePCAplot(simulated_datasets)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.