suppressPackageStartupMessages({ suppressMessages({ library(BiocOncoTK) library(BiocStyle) library(dplyr) library(DBI) library(magrittr) library(pogos) library(org.Hs.eg.db) library(restfulSE) }) })
This package provides a unified approach to programming with Bioconductor components to address problems in cancer genomics. Central concerns are:
The NCI Thesaurus project distributes an OBO representation of oncotree. We
can use this through the r Biocpkg("ontoProc")
(devel branch only) and r CRANpkg("ontologyPlot")
packages. Code for visualizing the location of 'Glioblastoma' in the context of its 'siblings'
in the ontology follows.
library(ontoProc) library(ontologyPlot) oto = getOncotreeOnto() glioTag = names(grep("Glioblastoma$", oto$name, value=TRUE)) st = siblings_TAG(glioTag, oto, justSibs=FALSE) if (.Platform$OS.type != "windows") { onto_plot(oto, slot(st, "ontoTags"), fontsize=50) }
In conjunction with r Biocpkg("restfulSE")
which handles
aspects of the interface to BigQuery, this package
provides tools for working with the PanCancer atlas
project data.
A key feature distinguishing the pancancer-atlas project from TCGA is the availability of data from normal tissue or metastatic or recurrent tumor samples. Codes are used to distinguish the different sources:
BiocOncoTK::pancan_sampTypeMap
The following code will run if you have a valid setting
for environment variable CGC_BILLING
, to allow
BiocOncoTK::pancan_BQ() to generate a proper BigQueryConnection.
library(BiocOncoTK) if (nchar(Sys.getenv("CGC_BILLING"))>0) { pcbq = pancan_BQ() # basic connection BRCA_mir = restfulSE::pancan_SE(pcbq) }
The result is
> BRCA_mir class: SummarizedExperiment dim: 743 1068 metadata(0): assays(1): assay rownames(743): hsa-miR-30d-3p hsa-miR-486-3p ... hsa-miR-525-3p hsa-miR-892b rowData names(0): colnames(1068): TCGA-LD-A7W6 TCGA-BH-A18I ... TCGA-E9-A1N9 TCGA-B6-A0X0 colData names(746): bcr_patient_uuid bcr_patient_barcode ... bilirubin_upper_limit days_to_last_known_alive
To shift attention to the normal tissue samples provided, use
BRCA_mir_nor = restfulSE::pancan_SE(pcbq, assaySampleTypeCode="NT")
to find
class: SummarizedExperiment dim: 743 90 metadata(0): assays(1): assay rownames(743): hsa-miR-7641 hsa-miR-135a-5p ... hsa-miR-1323 hsa-miR-520d-5p rowData names(0): colnames(90): TCGA-BH-A18P TCGA-BH-A18S ... TCGA-E9-A1N6 TCGA-E9-A1N9 colData names(746): bcr_patient_uuid bcr_patient_barcode ... bilirubin_upper_limit days_to_last_known_alive
The intersection of the colnames from the two SummarizedExperiments thus formed (patients contributing both solid tumor and matched normal) has length 89.
You need to know what type of sample has been assayed for the tumor type of interest.
Here is how you find the candidates.
bqcon %>% tbl(pancan_longname("rnaseq")) %>% filter(Study=="GBM") %>% group_by(SampleTypeLetterCode) %>% summarise(n=n())
To get RNA-seq on recurrent GBM samples:
pancan_SE(bqcon, colDFilterValue="GBM", tumorFieldValue="GBM", assayDataTableName=pancan_longname("rnaseq"), assaySampleTypeCode="TR", assayFeatureName="Symbol", assayValueFieldName="normalized_count")
Suppose we want to work with the mRNA, RPPA, 27k/450k merged methylation and miRNA data together. We can invoke pancan_SE again, specifying the appropriate tables and fields.
BRCA_mrna = pancan_SE(pcbq, assayDataTableName = pancan_longname("rnaseq"), assayFeatureName = "Entrez", assayValueFieldName = "normalized_count") BRCA_rppa = pancan_SE(pcbq, assayDataTableName = pancan_longname("RPPA"), assayFeatureName = "Protein", assayValueFieldName = "Value") BRCA_meth = pancan_SE(pcbq, assayDataTableName = pancan_longname("27k")[2], assayFeatureName = "ID", assayValueFieldName = "Beta")
After obtaining the clinical data for BRCA with
library(dplyr) library(magrittr) clinBRCA = pcbq %>% tbl(pancan_longname("clinical")) %>% filter(acronym=="BRCA") %>% as.data.frame() rownames(clinBRCA) = clinBRCA[,2] clinDF = DataFrame(clinBRCA)
we use
library(MultiAssayExperiment) brcaMAE = MultiAssayExperiment( ExperimentList(rnaseq=BRCA_mrna, meth=BRCA_meth, rppa=BRCA_rppa, mirna=BRCA_mir),colData=clinDF)
to generate brcaMAE
. No assay data are present in
this object, but data are retrieved on request.
> brcaMAE A MultiAssayExperiment object of 4 listed experiments with user-defined names and respective classes. Containing an ExperimentList class object of length 4: [1] rnaseq: SummarizedExperiment with 20531 rows and 1097 columns [2] meth: SummarizedExperiment with 22601 rows and 1067 columns [3] rppa: SummarizedExperiment with 259 rows and 873 columns [4] mirna: SummarizedExperiment with 743 rows and 1068 columns Features: experiments() - obtain the ExperimentList instance colData() - the primary/phenotype DataFrame sampleMap() - the sample availability DataFrame `$`, `[`, `[[` - extract colData columns, subset, or experiment *Format() - convert into a long or wide DataFrame assays() - convert ExperimentList to a SimpleList of matrices
It is convenient to check for sample availability for the
different assays using upsetSamples
in r Biocpkg("MultiAssayExperiment")
.
The following code produces figure 1 of the restfulSE supplement.
library(BiocOncoTK) infilGenes = c(`PD-L1`="CD274", `PD-L2`="PDCD1LG2", CD8A="CD8A") tumcodes = c("COAD", "STAD", "UCEC") combs = expand.grid(tumcode=tumcodes, ali=names(infilGenes), stringsAsFactors=FALSE) combs$sym = infilGenes[combs$ali] bq = pancan_BQ() exprByMSI = function(bq, tumcode, genesym, alias) { print(tumcode) if (missing(alias)) alias=genesym ex = bindMSI(buildPancanSE(bq, tumcode, assay="RNASeqv2")) ex = replaceRownames(ex) data.frame( patient_barcode=colnames(ex), acronym=tumcode, symbol = genesym, alias = alias, log2ex=log2(as.numeric(SummarizedExperiment::assay(ex[genesym,]))+1), msicode = ifelse(ex$msiTest >= 4, ">=4", "<4")) } allshow = lapply(1:nrow(combs), function(x) exprByMSI(bq, combs$tumcode[x], combs$sym[x], combs$ali[x])) rr = do.call(rbind, allshow) library(ggplot2) png(file="microsatpan2.png") ggplot(rr, aes(msicode, log2ex)) + geom_boxplot() + facet_grid(acronym~alias) + ylab("log2(normalized expr. + 1)") + xlab("microsatellite instability score") dev.off()
The ggMutDens
, ggFeatDens
and ggFeatureSegs
functions
were created to support the image given here. ggMutDens
in particular depends upon a working BigQuery connection to
the ISB-CGC PanCan-atlas project.
The detailed code for this display is:
library(BiocOncoTK) library(AnnotationHub) ah = AnnotationHub() tc = ah[["AH5090"]] tc$name = "TF" tfplot = ggFeatDens(tc, mcolvbl="name") library(EnsDb.Hsapiens.v75) segplot=ggFeatureSegs() bq = pancan_BQ() # requires that CGC_BILLING is set mutplot = ggMutDens(bq) library(cowplot) plot_grid(mutplot, tfplot, segplot, align="v", nrow=3)
The API for pancan_SE
in r Biocpkg("restfulSE")
is complicated.
args(restfulSE::pancan_SE)
Long, metadata-laden names are used for some tables, the clinical characteristics table has over 700 variables, and fields bearing information common to different tables may not have common names. Help is needed to permit programming for integrative analysis. BiocOncoTK provides the following assistance:
pancan_app
: a shiny app that provides interactive table and data
overviewspancan_longname
: a helper for generating the long table names
using a hint that will be processed by agrep
:pancan_longname("rnaseq")
pancan_BQ
: a function that will generate a BigQueryConnection
instance provided billing code and Google authentication succeed.We assume that an ISB-CGC Google BigQuery billing number
is assigned to the environment variable CGC_BILLING
.
First we list the tables available and have a look at the RNA-seq table.
billco = Sys.getenv("CGC_BILLING") if (nchar(billco)>0) { con = DBI::dbConnect(bigrquery::bigquery(), project="isb-cgc", dataset="TARGET_hg38_data_v0", billing=billco) DBI::dbListTables(con) con %>% tbl("RNAseq_Gene_Expression") %>% glimpse() }
## Observations: NA ## Variables: 16 ## $ project_short_name <chr> "TARGET-RT", "TARGET-RT", "TARGET-RT", "TARGE... ## $ case_barcode <chr> "TARGET-52-PARPFY", "TARGET-52-PARPFY", "TARG... ## $ sample_barcode <chr> "TARGET-52-PARPFY-11A", "TARGET-52-PARPFY-11A... ## $ aliquot_barcode <chr> "TARGET-52-PARPFY-11A-01R", "TARGET-52-PARPFY... ## $ gene_name <chr> "RIC8B", "ATOH7", "ZNF532", "XKR5", "RP11-33O... ## $ gene_type <chr> "protein_coding", "protein_coding", "protein_... ## $ Ensembl_gene_id <chr> "ENSG00000111785", "ENSG00000179774", "ENSG00... ## $ Ensembl_gene_id_v <chr> "ENSG00000111785.17", "ENSG00000179774.8", "E... ## $ HTSeq__Counts <int> 2396, 35, 5367, 17, 323, 1718, 1, 4, 3151, 25... ## $ HTSeq__FPKM <dbl> 3.212811104, 0.247184268, 4.693986615, 0.0353... ## $ HTSeq__FPKM_UQ <dbl> 7.790066e+04, 5.993448e+03, 1.138145e+05, 8.5... ## $ case_gdc_id <chr> "5cdd05ea-5285-50b7-971a-8bc005d01669", "5cdd... ## $ sample_gdc_id <chr> "7448bf2b-4ba0-5f98-ad0f-e87fa6619a43", "7448... ## $ aliquot_gdc_id <chr> "TARGET-52-PARPFY-11A-01R", "TARGET-52-PARPFY... ## $ file_gdc_id <chr> "f31fe296-402e-4e7d-b072-e4a6571a9c8a", "f31f... ## $ platform <chr> "Illumina", "Illumina", "Illumina", "Illumina...
Now let's see what tumor types are available.
if (nchar(billco)>0) { con %>% tbl("RNAseq_Gene_Expression") %>% select(project_short_name) %>% group_by(project_short_name) %>% summarise(n=n()) }
## # Source: lazy query [?? x 2] ## # Database: BigQueryConnection ## project_short_name n ## <chr> <int> ## 1 TARGET-NBL 9495831 ## 2 TARGET-AML 11310321 ## 3 TARGET-RT 302415 ## 4 TARGET-WT 7983756
NBL is neuroblastoma, RT is rhabdoid tumor, WT is Wilms' tumor.
Figure 3a of Barretina et al 2012 shows that cell lines with NRAS mutations can be ordered according to a measure of PD-0325901 activity, and that this drug activity measure is correlated with expression of AHR. We will acquire the mutation and expression data using BigQuery as provided by ISB.
Here is a listing of all tables:
billco = Sys.getenv("CGC_BILLING") if (nchar(billco)>0) { con = DBI::dbConnect(bigrquery::bigquery(), project="isb-cgc", dataset="ccle_201602_alpha", billing=billco) DBI::dbListTables(con) }
## [1] "AffyU133_RMA_expression" "Copy_Number_segments" ## [3] "DataFile_info" "Mutation_calls" ## [5] "Sample_information" "fastqc_metrics"
First we get an overview of the content:
muttab = con %>% tbl("Mutation_calls") length(muttab %>% colnames()) muttab %>% select(Cell_line_primary_name, Hugo_Symbol, Variant_Classification, cDNA_Change)%>% glimpse()
## [1] 53
Now let's filter by NRAS and get a feel for how many observations are returned per cell line.
nrastab = muttab %>% select(Variant_Classification, Hugo_Symbol, Cell_line_primary_name, CCLE_name) %>% filter(Hugo_Symbol == "NRAS") %>% group_by(Hugo_Symbol) nrastab %>% summarise(n=n()) nrasdf = nrastab %>% as.data.frame()
We need to carve up the CCLE name to get the organ.
spl = function(x) { z = strsplit(x, "_") fir = vapply(z, function(x)x[1], character(1)) rest = vapply(z, function(x) paste(x[-1], collapse="_"), character(1)) list(fir, rest) } nrasdf$organ = spl(nrasdf$CCLE_name)[[2]]
nrasdf = load_nrasdf() ```r head(nrasdf) table(nrasdf$organ) prim_names = as.character(nrasdf$Cell_line_primary_name)
Let's obtain the expression of AHR for these NRAS-mutated cell lines.
ccexp = con %>% tbl("AffyU133_RMA_expression") ccexp %>% glimpse() ccexp %>% select(Cell_line_primary_name, RMA_normalized_expression, HGNC_gene_symbol) %>% filter(HGNC_gene_symbol == "AHR") %>% filter(Cell_line_primary_name %in% nrasdf$Cell_line_primary_name) %>% as.data.frame() -> NRAS_AHR head(NRAS_AHR)
NRAS_AHR = load_NRAS_AHR() head(NRAS_AHR)
The pogos package (submitted, see github.com/vjcitn/pogos) includes software to query pharmacodb.pmgenomics.ca. We will use this to develop drug-response profiles for PD-0325901.
library(pogos) ccleNRAS = DRTraceSet(NRAS_AHR[,1], drug="PD-0325901") plot(ccleNRAS)
ccleNRAS = load_ccleNRAS() if (.Platform$OS.type != "windows") { plot(ccleNRAS) }
We'll define a responsiveness method, that takes a function f that is applied to the responses component of the dose-response profile.
responsiveness = function (x, f) { r = sapply(slot(x, "traces"), function(x) f(slot(slot(x,"DRProfiles")[[1]],"responses"))) data.frame(Cell_line_primary_name = slot(x,"cell_lines"), resp = r, drug = slot(x,"drug"), dataset = x@dataset) }
The activity area for a compound in this design is defined as
AA = function(x) sum((pmax(0, x/100))) head(rr <- responsiveness(ccleNRAS, AA)) summary(rr$resp)
This is based on the supplement to Barretina et al. 2012. (There a slightly different formula in the addendum which uses notation that includes multiplying by a factor of i for dose index level i.)
Let's merge the responsiveness data with the expression data for gene AHR.
rexp = merge(rr, NRAS_AHR) rexp[1:2,]
The CLUE platform is an interface to results of work on the connectivity map at Broad
Institute. Usage of functions in this toolkit requires an API key, which can be
acquired through registration at clue.io. Set the environment variable
CLUE_KEY
so that it can be found by Sys.getenv
to use default key
parameter
to functions described here.
A basic purpose of the interface to CLUE is to allow identification of gene signatures of perturbations in specific cellular contexts.
We have serialized data on cell lines and perturbagens available in the GSE70138 snapshot of LINCS.
data(cell_70138) names(cell_70138) table(cell_70138$primary_site) data(pert_70138) dim(pert_70138) names(pert_70138)
A number of API services have demonstration query expressions available in the package:
cd = clueDemos() names(cd) cd$sigs
We use query_clue
to query a service. Here we ask for
perturbagens that have EGFR among their targets. We'll retrieve
a single 'gold' signature identifier.
if (nchar(Sys.getenv("CLUE_KEY"))>0) { lkbytarg = query_clue(service="perts", filter=list(where=list(target="EGFR"))) print(names(lkbytarg[[1]])) sig1 = lkbytarg[[1]]$sig_id_gold[1] }
Now we obtain the metadata about this signature.
if (nchar(Sys.getenv("CLUE_KEY"))>0) { sig1d = query_clue(service="sigs", filter=list(where=list(sig_id=sig1))) print(names(sig1d[[1]])) print(head(sig1d[[1]]$pert_iname)) # perturbagen print(head(sig1d[[1]]$cell_id)) # cell type print(head(sig1d[[1]]$dn50_lm)) # some downregulated genes among the landmark print(head(sig1d[[1]]$up50_lm)) # some upregulated genes among the landmark }
Task: Assess the effects of perturbagens on transcription in the NPC cell line. We'll check for recurrence of landmark genes among the top 50 upregulated for perturbagens that are identified as HDAC inhibitors.
# use pertClasses() to get names of perturbagen classes in Clue if (nchar(Sys.getenv("CLUE_KEY"))>0) { tuinh = query_clue("perts", filter=list(where=list(pcl_membership=list(inq=list("CP_HDAC_INHIBITOR"))))) inames_tu = sapply(tuinh, function(x)x$pert_iname) npcSigs = query_clue(service="sigs", filter=list(where=list(cell_id="NPC"))) length(npcSigs) gns = lapply(npcSigs, function(x) x$up50_lm) perts = lapply(npcSigs, function(x) x$pert_iname) touse = which(perts %in% inames_tu) rec = names(tab <- sort(table(unlist(gns[touse])),decreasing=TRUE)[1:5]) cbind(select(org.Hs.eg.db, keys=rec, columns="SYMBOL"), n=as.numeric(tab)) }
We can abstract from this process a function that takes perturbagen classes and cell lines to deliver collections of LINCS signatures of genes considered to produce transcriptional activities of certain kinds.
In this section we illustrate different modalities for acquiring and working with single cell transcriptomics data, after processing by the CONQUER workflow.
The Patel et al. experiment assayed 864 cells.
A standard in-memory representation is straightforward.
The curated SummarizedExperiment is distributed in an AWS S3
bucket sponsored by the Bioconductor Foundation. The loadPatel
function retrieves this and places it in a `r Biocpkg("BiocFileCache")
instance.
if (interactive()) { patelSE = loadPatel() # uses BiocFileCache patelSE assay(patelSE[1:4,1:3]) # in memory }
Exploratory analysis of this dataset is described in the companion vignette on single cell transcriptomics for GBM.
The Darmanis et al. experiment assayed over 3500 cells. The
CONQUER compressed RDS representation of all the data is about
4 GB on disk. The gene level quantifications and sample-level data
were manually extracted from this archive. The gene level quantifications
in the count_lstpm
form were then loaded into a public HDF object store
sponsored by John Readey. These data will persist in this format for some time; a
Bioconductor-sponsored representation will be introduced as soon as possible.
darmSE = BiocOncoTK::darmGBMcls # count_lstpm from CONQUER darmSE assay(darmSE) # out of memory
BiocOncoTK
is a result of work carried out under NCI ITCR U01 "Accelerating
cancer genomics with cloud-scale Bioconductor". This package illustrates
several Bioconductor-based representations of cancer data and metadata.
Some of the resources, such as the PanCancer atlas, CCLE, and high-resolution
single-cell transcriptomics studies are sufficiently large that cloud-oriented
representation and analysis may be cost-effective. As this package matures,
additional resources will be highlighted, with particular attention to
integration processes.
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.