knitr::opts_chunk$set(echo = TRUE)
if (!requireNamespace("BiocManager", quietly = TRUE)) install.packages("BiocManager") BiocManager::install("curatedTBData")
library(curatedTBData) library(dplyr) library(SummarizedExperiment) library(sva)
The DataSummary
table summarized the list of available studies and
their metadata information contained in the curatedTBData package. \
The table helps users to query avialable datasets quickly.
# Remove GeographicalRegion, Age, DiagnosisMethod, Notes, Tissue, HIVStatus for concise display data("DataSummary", package = "curatedTBData") DataSummary |> dplyr::select(-c(`Country/Region`, Age, DiagnosisMethod, Notes, Tissue, HIVStatus)) |> DT::datatable()
Users can use curatedTBData()
function to access data.
There are three arguments in the function.
The first argument study_name
represents the names of the data that are used
to determine the resources of interests.
Users can find all available resource names from DataSummary$Study
.
The second argument dry.run
enables users to determine the resources's
availability before actually downloading them.
When dry.run
is set to TRUE
, the output includes names of the resources.
The third argument curated.only
allows the users to access
the curated ready-to-use data.
If curated.only
is TRUE
, the function only download the curated
gene expression profile and the clinical annotation of the corresponding data.
If curated.only
is FALSE
, the function downloads all available resources
for input studies.
curatedTBData("GSE19439", dry.run = TRUE, curated.only = FALSE)
To download complete data for a Microarry study (e.g. GSE19439),
we set dry.run = FALSE
and curated.only = FALSE
.
There are two experiments assay being included in the output Microarray studies.
The first experiment is assay_curated
, which is a matrix
that represents
normalized, curated version of the gene expression profile.
The second experiment is assay_raw
, which is a
r Biocpkg("SummarziedExperiment")
object that contains the raw gene expression profile and information about probe features.
GSE19439 <- curatedTBData("GSE19439", dry.run = FALSE, curated.only = FALSE) GSE19439
When accessing data for RNA sequencing studies, another assay called
assay_reprocess
is included. This matrix
represents the reprocessed version
of gene expression profile from the raw .fastq files using
r Biocpkg("Rsubread")
.
GSE79362 <- curatedTBData("GSE79362", dry.run = FALSE, curated.only = FALSE) GSE79362
A list of r Biocpkg("MultiAssayExperiment")
objects is returned
when dry.run
is FALSE
. \
To save running time, in the following example, we set the curated.only = TRUE
and selected five studies that belong to GSE19491
from the Gene Expression Omnibus.
myGEO <- c("GSE19435", "GSE19439", "GSE19442", "GSE19444", "GSE22098") object_list <- curatedTBData(myGEO, dry.run = FALSE, curated.only = TRUE)
object_list[1:2]
A full version example for RNA sequencing data: GSE79362.
GSE79362 <- curatedTBData("GSE79362", dry.run = FALSE, curated.only = FALSE)
GSE79362
The major advantage of using r Biocpkg("MultiAssayExperiment")
is the coordination of the meta-data and assays when sub-setting.
The MultiAssayExperiment object has built-in function for subsetting samples
based on column condition. \
The following code shows how to select samples with only active TB.
GSE19439 <- object_list$GSE19439 GSE19439[, GSE19439$TBStatus == "PTB"]["assay_curated"] # 13 samples
The following example shows how to subset patients with active TB and LTBI
GSE19439[, GSE19439$TBStatus %in% c("PTB", "LTBI")]["assay_curated"] # 30 samples
The combine_objects()
function provides an easy implementation for
combining different studies based on common gene symbols.
The function returns a r Biocpkg("SummarizedExperiment")
object that contains
the merged assay and associated clinical annotation. Noticed that GSE74092 is usually
removed from merging, because it used quantitative PCR,
which did not have enough coverage to capture all genes. \
There are two arguments in the combine_objects()
function.
The first one is object_list
, which takes a list of
r Biocpkg("MultiAssayExperiment")
objects obtained from curatedTBData()
. Notice that the names(object_list)
should not be NULL
and must be unique for each object within the list,
so that we can keep track the original study after merging.
The second argument is experiment_name
, which can be a string or vector of strings representing the name of the assay from the object.
GSE19491 <- combine_objects(object_list, experiment_name = "assay_curated", update_genes = TRUE) GSE19491
When the objects are merged, the original individual data tag can be found
in the Study
section from the metadata.
unique(GSE19491$Study)
It is also possible to combine the given list of objects with
different experiment names. In this case, the experiment_name
is a vector of
string that corresponds to each of object from the input list.
exp <- combine_objects(c(GSE79362[1], object_list[1]), experiment_name = c("assay_reprocess_hg19", "assay_curated"), update_genes = TRUE) exp
If datasets are merged, it is typically recommended to
remove a very likely batch effect.
We use the ComBat()
function from r Biocpkg("sva")
to remove
potential batch effect between studies. \
In the following example, each study is viewed as one batch. The batch corrected
assay will be stored in a r Biocpkg("SummarizedExperiment")
object.
batch1 <- colData(GSE19491)$Study combat_edata1 <- sva::ComBat(dat = assay(GSE19491), batch = batch1) assays(GSE19491)[["Batch_corrected_assay"]] <- combat_edata1 GSE19491
The function subset_curatedTBData()
allows the users to subset a list of
r Biocpkg("MultiAssayExperiment")
with the output contains
the exact conditions given by the annotationCondition
. \
With subset_curatedTBData()
, users can quickly subset desired results
from r Biocpkg("curatedTBData")
database without checking individual object. \
There are four arguments in this function.
The theObject
represents a r Biocpkg("MultiAssayExperiment")
or
r Biocpkg("SummarizedExperiment")
object.
The annotationColName
is a character that indicates the
column name in the metadata. The annotationCondition
is a character or
vector of characters that the users intend to select.
In the following example, we call subset_curatedTBData()
function to
subset samples with active TB (PTB
) and latent TB infection (LTBI
) for
binary classification.
multi_set_PTB_LTBI <- lapply(object_list, function(x) subset_curatedTBData(x, annotationColName = "TBStatus", annotationCondition = c("LTBI", "PTB"), assayName = "assay_curated")) # Remove NULL from the list multi_set_PTB_LTBI <- multi_set_PTB_LTBI[!sapply(multi_set_PTB_LTBI, is.null)] multi_set_PTB_LTBI[1:3]
The HIV status (HIVStatus
) and diabetes status (DiabetesStatus
) for each
subject were also recorded for each study in the r Biocpkg("curatedTBData")
. \
In the following example, we select subjects with HIV positive from the input. \
Users can also find HIV status information for each study by
looking at the column: HIVStatus
from DataSummary
. \
When the the length of the annotationCondition
equals to 1, we can subset using
either MultiAssayExperiment
built-in procedure or subset_curatedTBData
.
multi_set_HIV <- lapply(object_list, function(x) subset_curatedTBData(x, annotationColName = "HIVStatus", annotationCondition = "Negative", assayName = "assay_curated")) # Remove NULL from the list multi_set_HIV <- multi_set_HIV[!vapply(multi_set_HIV, is.null, TRUE)] multi_set_HIV[1:3]
sessionInfo()
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.