View source: R/1_prepareData.R
prepareData | R Documentation |
Prepare data into format for diffcyt
pipeline
prepareData(
d_input,
experiment_info,
marker_info,
cols_to_include = NULL,
subsampling = FALSE,
n_sub = NULL,
seed_sub = NULL
)
d_input |
Input data. Must be a |
experiment_info |
|
marker_info |
|
cols_to_include |
Logical vector indicating which columns to include from the input data. Default = all columns. |
subsampling |
Whether to use random subsampling to select an equal number of cells from each sample. Default = FALSE. |
n_sub |
Number of cells to select from each sample by random subsampling, if
|
seed_sub |
Random seed for subsampling. Set to an integer value to generate
reproducible results. Default = |
Functions in the diffcyt
analysis pipeline assume that input data is provided as
a SummarizedExperiment
object, which contains a single matrix of
expression values, together with row and column meta-data.
This function accepts a flowSet
or a list of flowFrames
,
data.frames
, or matrices as input (i.e. one flowFrame
or list item per
sample). The function then concatenates the data tables into a single matrix of values,
and adds row and column meta-data.
Row meta-data should be provided as a data frame named experiment_info
,
containing columns of relevant experiment information, such as sample IDs and group
IDs (for each sample). This must contain at least a column named sample_id
.
Column meta-data should be provided as a data frame named marker_info
,
containing the following columns of marker information. The column names must be as
shown.
marker_name
: protein marker names (and column names for any other columns)
marker_class
: factor indicating the protein marker class for each column
of data (usually, entries will be either "type"
, "state"
, or
"none"
)
The split into 'cell type' and 'cell state' markers is crucial for the analysis. Cell type markers are used to define cell populations by clustering, and to test for differential abundance of cell populations; while cell state markers are used to test for differential states within cell populations.
The optional argument cols_to_include
allows unnecessary columns (e.g. any
columns not containing protein markers) to be discarded.
Optionally, random subsampling can be used to select an equal number of cells from each
sample (subsampling = TRUE
). This can be useful when there are large differences
in total numbers of cells per sample, since it ensures that samples with relatively
large numbers of cells do not dominate the clustering. However, subsampling should
generally not be used when rare cell populations are of interest, due to the
significant loss of information if cells from the rare population are discarded.
d_se
: Returns data as a SummarizedExperiment
containing a
single matrix of data (expression values) in the assays
slot, together with
row meta-data (experiment information) and column meta-data (marker information). The
metadata
slot also contains the experiment_info
data frame, and a
vector n_cells
of the number of cells per sample; these can be accessed with
metadata(d_se)$experiment_info
and metadata(d_se)$n_cells
.
# For a complete workflow example demonstrating each step in the 'diffcyt' pipeline,
# see the package vignette.
# Function to create random data (one sample)
d_random <- function(n = 20000, mean = 0, sd = 1, ncol = 20, cofactor = 5) {
d <- sinh(matrix(rnorm(n, mean, sd), ncol = ncol)) * cofactor
colnames(d) <- paste0("marker", sprintf("%02d", 1:ncol))
d
}
# Create random data (without differential signal)
set.seed(123)
d_input <- list(
sample1 = d_random(),
sample2 = d_random(),
sample3 = d_random(),
sample4 = d_random()
)
experiment_info <- data.frame(
sample_id = factor(paste0("sample", 1:4)),
group_id = factor(c("group1", "group1", "group2", "group2")),
stringsAsFactors = FALSE
)
marker_info <- data.frame(
channel_name = paste0("channel", sprintf("%03d", 1:20)),
marker_name = paste0("marker", sprintf("%02d", 1:20)),
marker_class = factor(c(rep("type", 10), rep("state", 10)),
levels = c("type", "state", "none")),
stringsAsFactors = FALSE
)
# Prepare data
d_se <- prepareData(d_input, experiment_info, marker_info)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.