BiocStyle::markdown()
flowGraph is an R package used to identify candidate biomarkers for disease diagnosis in flow cytometry data. It does so by identifying driver cell populations whose abundance changes significantly and independently given a disease.
flowGraph takes cell counts as input and outputs SpecEnr values for each cell population within a flow cytometry sample, based on their expected proportion. SpecEnr accounts for dependencies between cell populations such that we can use it to flag only cell populations whose abundance change is incurred wholly or in part because of its association with a sample class (e.g. healthy vs sick).
flowGraph can be installed via Bioconductor.
You can also install the development version directly from Github using BiocManager:
if (!require("BiocManager")) install.packages('BiocManager') BiocManager::install("aya49/flowGraph")
The theory, proof, and algorithm behind the SpecEnr statistic used in the flowGraph package can be found in the following paper. Please consider citing if you found it helpful.
bibtex:
@article{yue2019identifying, title={Identifying differential cell populations in flow cytometry data accounting for marker frequency}, author={Yue, Alice and Chauve, Cedric and Libbrecht, Maxwell and Brinkman, Ryan}, journal={BioRxiv}, pages={837765}, year={2019}, publisher={Cold Spring Harbor Laboratory} }
The scripts and data from the paper can be downloaded ons Zenodo.
The main goal of flowGraph is to help users interpret which cell populations are differential between samples of different etiologies. To understand how flowGraph defines cell populations, we introduce the cell hierarchy model below. This model will serve as the basis for the analysis done in the flowGraph package.
library(flowGraph)
A cell hierarchy is a directed acyclic graph that maps out all possible cell populations. In this graph, each cell population is a node while relations between cell populations are represented by edges. Each node is labelled by whether or not its corresponding cell population contains cells that does/not contain a subset of markers. For example, if the markers used in an experiment is $A$, $B$, $C$, $D$, then $A^-$ and $A^+$ would represent the cell population with cells that does/not contain marker $A$ (e.g. Figure \ref{fig:ch}).
If you are using threshold gates, $A^+$/$A^-$ is the count of cells with a fluorescent intensity (FI) value higher/lower than the threshold. If you have multiple thresholds, for example 3 thresholds, for $A$, then you would have $A^-$, $A^+$, $A^++$, $A^+++$ with thresholds in between.
If you are using polygon gates, then $A$ would be the name of your polygon gate on any two dimension and $A^-$ would be the cells inside the gate and $A^+$ would be the cells outside the gate.
We also accept cell population names where marker conditions are separated by an underscore e.g. $A^+_B^+$.
no_cores <- 1 data(fg_data_pos2) meta_cell <- get_phen_meta(colnames(fg_data_pos2$count)) suppressWarnings({ pccell <- flowGraph:::get_phen_list(meta_cell, no_cores) }) gr <- set_layout_graph(list(e=pccell$edf, v=meta_cell)) # layout cell hierarchy gr <- ggdf(gr) gr$v$colour <- ifelse(!grepl("[-]",gr$v$phenotype), 1,0) # "nodes with only marker conditions", "other nodes") gr$v$label <- gr$v$phenotype gr$v$v_ind <- gr$v$label_ind <- TRUE gr$e$e_ind <- !grepl("[-]",gr$e$from) & !grepl("[-]",gr$e$to)
knitr::opts_template$set(figure1=list(fig.height=9, fig.width=9)) plot_gr(gr, main="Example cell hierarchy with markers A, B, C, D")
Traditionally, cell populations are quantified by their proportion, or their cell count over the total cell count. A downside to this analysis is that if one cell population is differential, then all cell populations that contain cells that are also in that differential cell population would also be flagged as significantly differential. By incorporating information on relations between cell populations, flowGraph uses the notion of expected proportions and SpecEnr or specific enrichment as a replacement for proportions to isolate only differential cell populations.
This section will run through a simple example of how flowGraph can be used to analyze a set of flow cytometry samples.
Typically, one would input a sample x cell population matrix and a directory where one wants to save all of the flowGraph resuts and plots:
no_cores <- 1 # number of cores to parallelize on. data(fg_data_pos2) fg <- flowGraph( fg_data_pos2$count, # sample x cell population matrix meta=fg_data_pos2$meta, # a data frame with each sample's meta data path="flowGraph_example", # a directory for flowGraph to output results to no_cores=no_cores) # number of cores to use, typically 1 for small data
Click here for a script with an example of how to get
fg_data_pos2$count
starting from a raw .fcs file.
Since this is only for one file, the result is only one row
in the sample x cell population matrix.
One will need to rbind()
many of these rows from many
.fcs files to create a sample x cell population matrix.
One will also need to create a meta data matrix containing at least the class or category of each file (e.g. control vs experiment).
The flowGraph object can be loaded into R from the specified directory:
fg <- fg_load("flowGraph_example")
The fg_data_pos2$count
argument can be generated manually,
or it can be created using the flowType package given a preprocessed .fcs file. The preprocessing should consist of: compensated/unmixing (flow/spectral), transformation, and cleaning.
Preferrably, all non-single cells have been gated and removed from the file. For example:
fg_save(fg, folder_path="flowGraph_example")
This flowGraph object can be further analyzed and modified by using methods described in the following sections.
The package contains two data sets:
- fg_data_fca
[@aghaeepour2013critical]: a real data set comparing healthy
and AML (acute myeloid leukemia) positive subjects; it is known that there is
an outlier sample and that cell population node $CD34^+$ increases in production
in the AML positive subjects' samples.
- fg_data_pos2
: a positive control data set where cell population node
$A^{+}B^{+}C^+$ is artificially increased by 50\%.
- fg_data_pos30
: a positive control data set where cell population node
$A^{+...}B^{+...}C^+$ is artificially increased by 50\%. Note this data set
contains multiple thresholds for markers $A$ and $B$.
Both of these are lists containing elements:
- count
: a sample x cell population numeric matrix containing cell count data.
- meta
: a data frame containing meta data on the samples in count
.
The sample names in the id
corresponds with the row names in count
.
# data(fg_data_fca) data(fg_data_pos2) # data(fg_data_pos30) # ?fg_data_fca # ?fg_data_pos2 # ?fg_data_pos30
To contain information regarding cell population quantification and the
cell hierarchy structure in one place, we use a flowGraph object to conduct
analysis. To initialize a flowGraph object, the user can give as input,
a numeric vector or matrix.
For examples on how all of these options
can be used, see ?flowGraph
.
For our example, we will directly use a numeric matrix as provided by our
fg_data_pos2
data set. By default, flowGraph
will calculate all of the
proportion prop
, and SpecEnr specenr
.
# no_cores <- 1 # number of cores to parallelize on. data(fg_data_pos2) fg <- flowGraph(fg_data_pos2$count, meta=fg_data_pos2$meta, no_cores=no_cores)
By default, calculate_summary
is set to TRUE
so that a default set of
summary statistics will be calculated for the SpecEnr
node feature and prop
edge feature. Note that if the user decides to do this,
the class
in summary_pars
must be the column name in data frame meta
with the class labels. This is set to "class"
by default.
A class in this context is, for example, an experiment or control sample.
If the user does not wish for this to be calculated during construction
of the flowGraph object, the user can set summary_pars
to NULL
in the
flowGraph
function. Note thatsummary_pars
must be specified if the fast
version of flowGraph is used.
meta
can be given/modified at a later time.
Just make sure the meta data is a data frame
containing a id
column where its values correspond to the row names in
the flowGraph objects' feature matrices.
meta <- fg_get_meta(fg) head(meta) mcount <- fg_get_feature(fg, "node", "count") head(rownames(mcount))
The input to flowGraph is a sample x cell population phenotype matrix containing the cell count of each cell population for each sample.
The row names of the input matrix should be sample ID, otherwise, flowGraph will create sample IDs for you.
The column names of the input matrix should be cell population phenotype names
that follow cell population naming conventions.
Markers/measurements must not contain underscores, dashes, or pluses
(_
, +
, -
). If underscores were
used to separate marker/measurement conditions e.g. ssc+_cd45-, the underscores
will be removed from the column names of the features matrices but will be
saved in fg_get_meta(fg)$phenotype_
.
To calculate the SpecEnr of a cell population (e.g. A+B+C+), flowGraph requires that all of its parent (A+B+, B+C+, A+C+) and grandparent cell populations (A+, B+, C+) are available.
The default summary statistics is calculated using the Wilcoxan signed-rank test and adjusted
using the byLayer
adjustment method. This adjustment method is a family-wise method that
multiplies the p-value for each cell population by the number of nodes in
its layer and the total number of layers in the cell hierarchy on which there
exists a cell population node.
Below, we retrieve this summary statistic and list out the cell populations with the most significant p-values as per below.
# get feature descriptions fg_get_summary_desc(fg) # get a summary statistic fg_sum <- fg_get_summary(fg, type="node", summary_meta=list( node_feature="SpecEnr", test_name="t_diminish", class="class", labels=c("exp","control")) ) # fg_sum <- fg_get_summary(fg, type="node", index=1) # same as above # list most significant cell populations p <- fg_sum$values # p values head(sort(p),30)
To make changes to the flowGraph object, see functions that start with fg_
.
Once we have made all the changes necessary, we can save the flowGraph object to a folder. Inside the folder, all the feature values and summary statistics are saved as csv files. Plots for each summary statistic can also optionally be saved to this folder.
The same folder directory is used to load the flowGraph object when needed again.
fg_save(fg, "path/to/user/specified/folder/directory") # save flowGraph object fg <- fg_load("path/to/user/specified/folder/directory") # load flowGraph object
See other flowGraph object initialization options in the Appendix 2.
The following sections will go over additional options for summary statistics, and result interpretation.
The flowGraph object initially contains meta data on the samples and
cell population nodes (phenotypes). The most basic way of understanding what
is inside a flowGraph object is by using show
. This shows a description of
the flowGraph object and returns a list of data frames containing information
on the node and edge features and the summary statitics performed on them
shortly in this vignette.
show(fg)
One can obtain meta data on samples and cell populations as follows.
Note that information on cell populations is given to the user in the form of a
graph
or a list contianing data frames v
and e
. The former represents the
nodes or the cell populatins, and the latter represent edges or the relation
between cell population --- note that edges always point from parent to
child cell populations indicative of whether or not a cell population is a
sub-population of another.
# get sample meta data head(fg_get_meta(fg)) # modify sample meta data meta_new <- fg_get_meta(fg) meta_new$id[1] <- "new_sample_id1" fg <- fg_replace_meta(fg, meta_new)
# get cell population meta data gr <- fg_get_graph(fg) head(gr$v) head(gr$e)
The user can also extract or modify the features inside a flowGraph object.
For adding new features, unless needed, we recommend users stick with the
default feature generation methods that starts with fg_feat_
.
# get feature descriptions fg_get_feature_desc(fg) # get count node feature mc <- fg_get_feature(fg, type="node", feature="count") dim(mc)
# add a new feature; # input matrix must contain the same row and column names as existing features; # we recommend users stick with default feature generation methods # that start with fg_feat_ fg <- fg_add_feature(fg, type="node", feature="count_copy", m=mc) fg_get_feature_desc(fg)
# remove a feature; note, the count node feature cannot be removed fg <- fg_rm_feature(fg, type="node", feature="count_copy") fg_get_feature_desc(fg)
Once a flowGraph object is created, the user can calculate summary statistics for any of the features it contains.
We recommend the user use the fg_summary
function. Its default
summary statistic is the significance T-test along with a byLayer
p-value adjustment method. The user can specify other summary statistics or
adjustment methods by providing the name of the method or a function to
parameters test_custom
or adjust_custom
.
fg_get_summary_desc(fg) # calculate summary statistic fg <- fg_summary(fg, no_cores=no_cores, class="class", label1="control", node_features="count", edge_features="NONE", overwrite=FALSE, test_name="t", diminish=FALSE) fg_get_summary_desc(fg)
# get a summary statistic fg_sum1 <- fg_get_summary(fg, type="node", summary_meta=list( feature="count", test_name="t", class="class", label1="control", label2="exp")) names(fg_sum1)
# remove a summary statistic fg <- fg_rm_summary(fg, type="node", summary_meta=list( feature="count", test_name="t", class="class", label1="control", label2="exp")) fg_get_summary_desc(fg)
# add a new feature summary; # input list must contain a 'values', 'id1', and 'id2' containing summary # statistic values and the sample id's compared; # we recommend users stick with default feature generation method fg_summary fg <- fg_add_summary(fg, type="node", summary_meta=list( feature="SpecEnr", test_name="t_copy", class="class", label1="control", label2="exp"), p=fg_sum1) fg_get_summary_desc(fg)
A summary static statistic, once obtained, is a list containing:
- values
: a vector of p-values for each node or edge.
- id1
and id2
: a vector of sample id's that were compared.
- test_fun
and adjust_fun
: the functions used to test and adjust the
summary statistic.
- m1
and m2
: a vector that summarizes one of the sets of samples compared.
These are not contained inside a flowGraph object but can be calculated on
the spot when retreiving a summary using fg_get_summary
by setting
parameter summary_fun
to a matrix function (default: colSums
) and not
NULL
. This usually does not need to be adjusted.
All of the calculated summaries can be visualized in the form of a
cell hierarchy plot using function fg_plot
. The plot can also be saved as a
PNG file if the path to this PNG file is provided as a string for its
path
parameter. Here, we do not save the plot, but we plot the returned
graph
list gr
given by fg_plot
that contains all the plotting columns
using the plot_gr
function.
# plotting functions default to plotting node feature SpecEnr # labelled with mean expected/proportion (maximum 30 labels are shown for clarity) # and only significant nodes based on the wilcox_byLayer_diminish summary statistic # are shown. # gr <- fg_plot(fg, p_thres=.01, show_bgedges=TRUE, # show background edges # node_feature="SpecEnr", edge_feature="prop", # test_name="t_diminish", label_max=30) gr <- fg_plot(fg, index=1, p_thres=.01, show_bgedges=TRUE) # plot_gr(gr)
While through plot_gr
, fg_plot
uses the ggplot2
package to create static
plot, the user can
also choose to plot gr
as an interactive plot by setting the interactive
parameter to TRUE
using the ggiraph
package.
# interactive version in beta plot_gr(gr, interactive=TRUE)
Summary statistics can also be analyzed using other plots.
For example, the user can plot a static/interactive
ggiraph
QQ plot of a chosen summary statistic. This plots the p-values against
a uniform distribution.
data(fg_data_pos2) fg1 <- flowGraph(fg_data_pos2$count, class=fg_data_pos2$meta$class, no_cores=no_cores)
fg_get_summary_desc(fg) fg_plot_qq(fg, type="node", index=1) fg_plot_qq(fg, type="node", index=1, logged=TRUE) # interactive version fg_plot_qq(fg, type="node", index=1, interactive=TRUE)
To understand how each p-value was obtained, the user can also plot the distribution of values as boxplots for a specific feature between features of different class labels.
fg_plot_box(fg, type="node", summary_meta=NULL, index=1, node_edge="A+")
Another useful plot is to compare the p-value and the difference between the mean of a feature value between samples of different classes. This should look like a volcano plot.
fg_plot_pVSdiff(fg, type="node", summary_meta=NULL, index=1) # interactive version fg_plot_pVSdiff(fg, type="node", summary_meta=NULL, index=1, interactive=TRUE)
The user can also manually specify how the cell hierarchy plot should look. The columns
needed for plotting in plot_gr
can be attached onto the graph
slot of the
fg
the flowGraph
object using the ggdf
function. For more information on
these columns, see ?ggdf
.
gr <- fg_get_graph(fg) gr <- ggdf(gr) gr$v$colour <- ifelse(!grepl("[-]",gr$v$phenotype), 1, 0) # "nodes with only marker conditions", "other nodes") gr$v$label <- gr$v$phenotype gr$v$v_ind <- gr$v$label_ind <- TRUE gr$e$e_ind <- !grepl("[-]",gr$e$from) & !grepl("[-]",gr$e$to) plot_gr(gr, main="Example cell hierarchy with markers A, B, C, D")
If the user is only interested in one set of class labels for a set of samples,
they can choose to use flowGraphSubset
, a faster version of the default constructor
flowGraph
. It is fast because the edge list, proportion, expected proportion,
and SpecEnr features are only calculated for cell populations who are in the
0'th and 1st layer, or have a significant parent population. The assumption here
is that cell populations who are significantly differentially abundant
must also have at least one significantly differentially abundant parent
population, which is true for almost all cases.
However, if the user wants to test different sets of sample class labels on the
same set of samples, we recommend using the default flowGraph
constructor
as it calculates SpecEnr for all cell populations. Since SpecEnr only has to be
calculated once, the user can apply multiple statistically significance tests
and ask questions about different class sets on the same SpecEnr values.
So in summary, ONLY USE THIS OVER flowGraph IF: 1) your data set has more than 10,000 cell populations and you want to speed up your calculation time AND 2) you only have one set of classes you want to test on the SAME SET OF SAMPLES (e.g. control vs experiment).
The parameters for flowGraphSubset
is a bit different than those in flowGraph
.
It is currently in beta, so we recommend reading the manual for it carefully.
data(fg_data_pos2) fg <- flowGraphSubset(fg_data_pos2$count, meta=fg_data_pos2$meta, no_cores=no_cores, summary_pars=flowGraphSubset_summary_pars(), summary_adjust=flowGraphSubset_summary_adjust())
The following is an output of sessionInfo()
on the system on which this
document was compiled.
sessionInfo()
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.