knitr::opts_chunk$set(echo = TRUE)
The HDCytoData
package is an extensible resource containing a set of publicly available high-dimensional flow cytometry and mass cytometry (CyTOF) benchmark datasets, which have been formatted into SummarizedExperiment
and flowSet
Bioconductor object formats. The data objects are hosted on Bioconductor's ExperimentHub
platform.
The objects each contain one or more tables of cell-level expression values, as well as all required metadata. Row metadata includes sample IDs, group IDs, patient IDs, reference cell population or cluster labels (where available), and labels identifying 'spiked in' cells (where available). Column metadata includes channel names, protein marker names, and protein marker classes (cell type, cell state, as well as non protein marker columns).
Note that raw expression values should be transformed prior to any downstream analyses (see below).
Currently, the package includes benchmark datasets used in our previous work to evaluate methods for clustering and differential analyses. The datasets are provided here in SummarizedExperiment
and flowSet
formats in order to make them easier to access and integrate into R/Bioconductor workflows.
For more details, see our paper describing the HDCytoData
package:
The package contains the following datasets, which can be grouped into datasets useful for benchmarking methods for (i) clustering, and (ii) differential analyses.
Clustering:
Differential analyses:
Extensive documentation is available in the help files for the objects. For each dataset, this includes a description of the dataset (e.g. biological context, number of samples and conditions, number of cells, number of reference cell populations, number and classes of protein markers, etc.), as well as an explanation of the object structures, details on accessor functions required to access the expression tables and metadata, and references to original data sources.
File sizes are listed in the help files for the datasets. The removeCache
function from the ExperimentHub
package can be used to clear the local download cache (see ExperimentHub
documentation).
The help files can be accessed by the dataset names, e.g. ?Bodenmiller_BCR_XL
or help(Bodenmiller_BCR_XL)
.
An updated list of all available datasets can also be obtained programmatically using the ExperimentHub
accessor functions, as follows. This retrieves a table of metadata from the ExperimentHub
database, which includes information such as the ExperimentHub ID, title, and description for each dataset.
suppressPackageStartupMessages(library(ExperimentHub)) # Create ExperimentHub instance ehub <- ExperimentHub() # Find HDCytoData datasets ehub <- query(ehub, "HDCytoData") ehub # Retrieve metadata table md <- as.data.frame(mcols(ehub)) head(md, 2)
This section shows how to load the datasets, using one of the datasets (Bodenmiller_BCR_XL
) as an example.
The datasets can be loaded by either (i) referring to named functions for each dataset, or (ii) creating an ExperimentHub
instance and referring to the dataset IDs. Both methods are demonstrated below.
See the help files (e.g. ?Bodenmiller_BCR_XL
) for details about the structure of the SummarizedExperiment
or flowSet
objects.
Load the datasets using named functions:
suppressPackageStartupMessages(library(HDCytoData)) # Load 'SummarizedExperiment' object using named function Bodenmiller_BCR_XL_SE() # Load 'flowSet' object using named function Bodenmiller_BCR_XL_flowSet()
Alternatively, load the datasets by creating an ExperimentHub
instance:
# Create ExperimentHub instance ehub <- ExperimentHub() # Find HDCytoData datasets query(ehub, "HDCytoData") # Load 'SummarizedExperiment' object using dataset ID ehub[["EH2254"]] # Load 'flowSet' object using dataset ID ehub[["EH2255"]]
Once the datasets have been loaded from ExperimentHub
, they can be used as normal within an R session. For example, using the SummarizedExperiment
form of the dataset loaded above:
# Load dataset in 'SummarizedExperiment' format d_SE <- Bodenmiller_BCR_XL_SE() # Inspect object d_SE length(assays(d_SE)) assay(d_SE)[1:6, 1:6] rowData(d_SE) colData(d_SE) metadata(d_SE)
Note that flow and mass cytometry data should be transformed prior to performing any downstream analyses, such as clustering. Standard transformations include the asinh
with cofactor
parameter equal to 5 for mass cytometry (CyTOF) data, or 150 for flow cytometry data (see Bendall et al. 2011, Supplementary Figure S2).
Interactive visualizations to explore the datasets can be generated from the SummarizedExperiment
objects using the iSEE ("Interactive SummarizedExperiment Explorer") package, available from Bioconductor (Soneson, Lun, Marini, and Rue-Albrecht, 2018), which provides a Shiny-based graphical user interface to explore single-cell datasets stored in the SummarizedExperiment
format. For more details, see the iSEE
package vignettes.
We welcome contributions or suggestions for new datasets to include in the HDCytoData
package. Contribution guidelines are provided in the Contribution guidelines vignette, available from Bioconductor.
If the HDCytoData
package is useful in your work, please cite the following paper:
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.