knitr::opts_chunk$set(cache = TRUE)
library(cBioPortalData) library(AnVIL)
The cBioPortal for Cancer Genomics website is a great resource for interactive exploration of study datasets. However, it does not easily allow the analyst to obtain and further analyze the data.
We've developed the cBioPortalData
package to fill this need to
programmatically access the data resources available on the cBioPortal.
The cBioPortalData
package provides an R interface for accessing the
cBioPortal study data within the Bioconductor ecosystem.
It downloads study data from the cBioPortal API (the full API specification can be found here https://cbioportal.org/api) and uses Bioconductor infrastructure to cache and represent the data.
We demonstrate common use cases of cBioPortalData
and curatedTCGAData
during Bioconductor conference
workshops.
We use the MultiAssayExperiment (@Ramos2017-og) package to integrate,
represent, and coordinate multiple experiments for the studies available in the
cBioPortal. This package in conjunction with curatedTCGAData
give access to
a large trove of publicly available bioinformatic data. Please see our
JCO Clinical Cancer Informatics publication here (@Ramos2020-ya).
Our free and open source project depends on citations for funding. When using
cBioPortalData
, please cite the following publications:
citation("MultiAssayExperiment") citation("cBioPortalData")
Data are provided as a single MultiAssayExperiment
per study. The
MultiAssayExperiment
representation usually contains SummarizedExperiment
objects for expression data and RaggedExperiment
objects for mutation and
CNV-type data. RaggedExperiment
is a data class for representing 'ragged'
genomic location data, meaning that the measurements per sample vary.
For more information, please see the RaggedExperiment
and
SummarizedExperiment
vignettes.
As we work through the data, there are some datasest that cannot be represented
as MultiAssayExperiment
objects. This can be due to a number of reasons such
as the way the data is handled, presence of mis-matched identifiers, invalid
data types, etc. To see what datasets are currently not building, we can
look refer to getStudies()
with the buildReport = TRUE
argument.
cbio <- cBioPortal() studies <- getStudies(cbio, buildReport = TRUE) head(studies)
The last two columns will show the availability of each studyId
for
either download method (pack_build
for cBioDataPack
and api_build
for
cBioPortalData
).
There are two main user-facing functions for downloading data from the cBioPortal API.
cBioDataPack
makes use of the tarball distribution of study data. This is
useful when the user wants to download and analyze the entirety of the data as
available from the cBioPortal.org website.
cBioPortalData
allows a more flexibile approach to obtaining study data
based on the available parameters such as molecular profile identifiers. This
option is useful for users who have a set of gene symbols or identifiers and
would like to get a smaller subset of the data that correspond to a particular
molecular profile.
This function will access the packaged data from \url{cBioPortal.org/datasets} and return an integrative MultiAssayExperiment representation.
## Use ask=FALSE for non-interactive use laml <- cBioDataPack("laml_tcga", ask = FALSE) laml
This function provides a more flexible and granular way to request a
MultiAssayExperiment
object from a study ID, molecular profile, gene panel,
sample list.
acc <- cBioPortalData(api = cbio, by = "hugoGeneSymbol", studyId = "acc_tcga", genePanelId = "IMPACT341", molecularProfileIds = c("acc_tcga_rppa", "acc_tcga_linear_CNA") ) acc
Note. To avoid overloading the API service, the API was designed to only query a part of the study data. Therefore, the user is required to enter either a set of genes of interest or a gene panel identifier.
Note that cBioPortalData
and cBioDataPack
obtain data diligently curated
by the cBio Portal data team. The original data and curation lies in the
https://github.com/cBioPortal/cBioPortal GitHub repository. However, despite
the curation efforts there may be some inconsistencies in identifiers
in the data. This causes our software to not work as intended though we have
made efforts to represent all the data from both API and tarball formats.
You may notice that the metadata()
may have some additional data that was
not able to be integrated in the MultiAssayExperiment
.
metadata(acc)
You will also get a message for studyId
s whose data has not been fully
integrated into a MultiAssayExperiment
.
cat( "Our testing shows that '%s' is not currently building.\n", " Use 'downloadStudy()' to manually obtain the data.\n", " Proceed anyway? [y/n]: y" )
For this reason, we have also provided the downloadStudy
, untarStudy
, and
loadStudy
functions to allow researchers to simply download the data and
potentially, manually curate it. Generally, we advise researchers to report
inconsistencies in the data in the cBioPortal data repository.
In cases where a download is interrupted, the user may experience a corrupt
cache. The user can clear the cache for a particular study by using the
removeCache
function. Note that this function only works for data downloaded
through the cBioDataPack
function.
removeCache("laml_tcga")
For users who wish to clear the entire cBioPortalData
cache, it is
recommended that they use:
unlink("~/.cache/cBioPortalData/")
We can use information in the colData
to draw a K-M plot with a few
variables from the colData
slot of the MultiAssayExperiment
. First, we load
the necessary packages:
library(survival) library(survminer)
We can check the data to lookout for any issues.
table(colData(laml)$OS_STATUS) class(colData(laml)$OS_MONTHS)
Now, we clean the data a bit to ensure that our variables are of the right type for the subsequent survival model fit.
collaml <- colData(laml) collaml[collaml$OS_MONTHS == "[Not Available]", "OS_MONTHS"] <- NA collaml$OS_MONTHS <- as.numeric(collaml$OS_MONTHS) colData(laml) <- collaml
We specify a simple survival model using SEX
as a covariate and we draw
the K-M plot.
fit <- survfit( Surv(OS_MONTHS, as.numeric(substr(OS_STATUS, 1, 1))) ~ SEX, data = colData(laml) ) ggsurvplot(fit, data = colData(laml), risk.table = TRUE)
If you are interested in a particular study dataset that is not currently building, please open an issue at our GitHub repository and we will do our best to resolve the issues with the code base. Data issues can be opened at the cBioPortal data repository.
We appreciate your feedback!
Click to see session info
sessionInfo()
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.