knitr::opts_chunk$set(collapse = TRUE)
The fishHook R package enables agile statistical analysis of coding and
non-coding mutational recurrence in cancer through generalized linear modeling (GLM)
of somatic mutation densities and their heterogeneity along the genome. fishHook
can be applied to the analysis of any collection of genomic intervals (e.g. genes, enhancers, promoters, genomic tiles) or
complex sets of intervals (e.g. genes sets representing pathways, enhancer sets known to interact with a gene). The fishHook
package is integrated with GenomicRanges
and data.table
packages, allowing easy incorporation into bioinformatics workflows
that employ the R/Bioconductor
ecosystem.
fishHook enables nomination of loci following the correction of known covariates of neutral mutation, e.g. chromatin state, replication timing, and nucleotide context. The goal of fishHook is to identify cancer drivers, i.e. loci that are under positive somatic selection and accumulate mutations above "background". This analysis hinges on the application of a correct null / background model, i.e. one that yields near-uniform Q-Q plots for P value distributions.
Though we provide pre-computed covariates and a "black box" command-line tool that applies several generic exome and whole genome analyses, the key power of fishHook lies in its customizability. This includes the ability to easily incorporate custom covariates and provide a framework for the generation and fitting of bespoke models to nominate loci, e.g. modeling variant- and tumor-type specific background mutational processes.
For installation instructions, please visit the fishHook github page. For background, it may help to have some familiarity with data.table
, GenomicRanges
, and gUtils
packages.
If you use fishHook in your work, please cite: Imielinski, Guo, Meyerson. Cell. 2017 Jan 26;168(3):460-472.
We will demonstrate a quick whole exome analysis using public TCGA lung
adenocarcinoma mutation data. Additional packages like gTrack
and
rtracklayer
will help with data import and visualization, but are not necessary to run fishHook
.
library(fishHook) library(gTrack) library(rtracklayer)
library(kableExtra) library(magrittr) library(tidyr)
Read in the mutation data and additional tracks that we will use in this analysis.
## mutation calls cached from public GDAC Broad firehose https://bit.ly/2sFxWY6 mutations = dt2gr(fread('http://mskilab.com/fishHook/hg19/luad.maf')) ## using data.table::fread to read maf and gUtils::dt2gr to convert to GRanges ## GENCODE v19 genes these are our "hypotheses" genes = gr.sub(import('http://mskilab.com/fishHook/hg19/gencode.v19.genes.gtf')) # rtracklayer::import reads gtf and gr.sub replaces chr genes = genes %Q% (gene_type == 'protein_coding') %Q% (level<3) # %Q% is a gUtils subsetting operator for GRanges ## protein coding CDS definitions cds = readRDS(gzcon(file('http://mskilab.com/fishHook/hg19/gencode.v19.cds.rds'))) ## bigWig file of fractional coverage of hg19 positions by Agilent exome ## we will use this in combination with cds to define eligible positions exomecov = import('http://mskilab.com/fishHook/hg19/exome_coverage.bw')
Take a peek at our mutations GRanges
object:
head(mutations[, c('Tumor_Sample_Barcode', 'Variant_Type', 'Variant_Classification', 'Reference_Allele', 'Tumor_Seq_Allele2')])
First we define the "eligible territory". This is a key component of all somatic mutational recurrence analyses, since much of the genome is not covered in sequencing studies. For example, in a whole exome sequencing dataset, less than 2% of the genome is reliably captured. In a targeted sequencing panel, this fraction will be even smaller. Even in whole genome sequencing using Illumina short reads, only a subset (70%) of the genome is reliably callable (Li Bioinformatics 2014 Oct 15;30(20):2843–8751).
Eligible territory coverage will influence the "denominator" of our recurrence analysis, i.e. the number of positions in each hypothesis interval where a mutation could have possibly been detected. If we do not take eligible territory into account we will mis-estimate the background mutation rate in a given region.
To define eligible territory for this whole exome analysis, we will choose the portion of cds (protein coding) bases that are captured in at least 95% of whole exomes, which represents about 24MB of genome.
eligible = exomecov %Q% which(score>0.95) # %Q% is a gUtils operator for subsetting on GRanges metadata eligible = reduce(intersect(eligible, cds, ignore.strand = TRUE)) # we intersect and reduce / collapse our prelim eligible intervals with CDS boundaries
We define "events" as nonsynonymous mutations. In this simple model, we will lump together SNVs (of different flavors), and indels (of different flavors). (We discuss more complex models that subdivide mutation types later in the tutorial).
events = mutations %Q% (Variant_Classification != 'Silent') ## using gUtils operator %Q% to subset mutations GRanges
Now that we have loaded our hypotheses
(i.e. genes), events
, and eligible
, we are ready to
create and analyze a basic FishHook
object. Under the hood, object creation triggers
counting of how many events are in the eligible portion of each hypothesis interval. We provide the
idcol
parameter so that each tumor sample (as defined by the
Tumor_Sample_Barcode
column in the events
GRanges
) will provide at most
one event to the counts of each interval.
fish = Fish(hypotheses = genes[, 'gene_name'], events = events, eligible = eligible, idcol = 'Tumor_Sample_Barcode')
fish
We can score
this basic FishHook
object, i.e. compute p values for every
hypothesis, using a simple glm that models a uniform mutation density along the
genome, i.e. the glm fits only an intercept and applies no covariates (after
correcting for the number of eligible bases in each interval).
The $res
field of the FishHook
contains a data.table
of scoring results.
$res
has one row per input hypothesis, with P values, FDRs, effect sizes (fold enrichment
above background), observed and predicted event counts and densities, and additional interval
annotations provided by the user in the hypotheses
GRanges
.
fish$score() head(fish$res %Q% order(p)) fish$qqp()
fish$score() head(gr2dt(fish$res)[order(p), ][, p:= as.character(p)]) %>% kable() %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>% scroll_box(width = "100%", height = "200px") fish$qqp(plotly = FALSE) #p = fish$qqp() #p
You will notice that this Q-Q plot appears curved and inflated, though its slope
(lambda
) is reasonably near 1. The low alpha
value (MLE of the alpha parameter
of the Gamma distribution),
suggests that the GLM is detecting additional variance
in the data that is unmodeled by a pure Poisson regression. Adding covariates
to the model should improve the quality of the fit.
The top hits in the plot (you can hover over them) include TP53 but also many unlikely cancer gene candidates. Among these are olfactory receptors, which are located in late replicating regions of the human genome and thus accumulate neutral mutations more frequently (Lawrence et al 2013 Nature Jul 11;499(7457):214-218).
To address these issues, we will load in data specifying replication timing,
chromatin state, and nucleotide context. These are all important determinants
of somatic neutral mutation density. We load these data in as GRanges
objects using functions in
data.table
, rtracklayer
, and gUtils
packages (however you are free
to use any GRanges
import utility of your choice).
We first load in replication timing data as a GRanges
then instantiate it as a Covariate
Replication timing information is contained in the
$score
metadata field of reptime
. We instantiate it as a covariate of type "numeric" by
specifying field score
.
## replication timing for NHEK obtained from https://bit.ly/2sRsXT9 and ## converted to rds via rtracklayer::import reptimedata = readRDS(gzcon(file('http://mskilab.com/fishHook/hg19/RT_NHEK_Keratinocytes_Int92817591_hg19.rds'))) ## instantiate covariate around 'score' field, name the Covariate "replication timing" reptime = Cov(data = reptimedata, field = 'score', name = 'ReplicationTiming')
Below, context
is a GRanges
object with 98 columns representing tri, di, and mononucleotide context counts in the
hg19 genome. Code for computing context (e.g. for another genome) is provided here.
We instantiate a numeric Covariate
object from context
, choosing only two of the
columns here to take into account G and C content. Note that the covariate object
can be vectorized (concatenated, subsetted) and instantiated around several
columns of an input GRanges
. As a result, contextcov
will be length 2
(representing G and C nucleotide fraction).
context = readRDS(gzcon(file('http://mskilab.com/fishHook/hg19/nucleotide.context.rds'))) gc = Cov(data = context, field = c('C', 'G')) ## instantiate Covariate around G and C fields
Finally we load in chromHMM data for cell line A549 from Epigenomics Roadmap. We will want to create a covariate that will model the fraction of heterochromatic and quiescent regions in each query interval.
To do so, we will create an "interval" covariate by not specifying a metadata field.
### data cached from https://egg2.wustl.edu/roadmap/data/byFileType/chromhmmSegmentations/ChmmModels/coreMarks/jointModel/final/E114_15_coreMarks_mnemonics.bed.gz chromhmm = gr.sub(import('http://mskilab.com/fishHook/hg19/E114_15_coreMarks_mnemonics.bed.gz'), 'chr', '') ## import from bed then gUtils::gr.sub to strip 'chr' identifier hetchromdata = chromhmm %Q% (name %in% c('8_ZNF/Rpts', '9_Het', '15_Quies')) # %Q% is gUtils operator for subsetting on GRanges metadata, in this case selecting for heterochromatic regions hetchrom = Cov(hetchromdata, name = 'Heterochromatin') ## instantiate interval covariate
We now add these covariates to the model. For type numeric
covariates, e.g. replication
timing, this will trigger the calculation of the average value of each
covariate within the eligible subset of each hypothesis interval. and the fractional overlap of of its eligible subset
This annotation is the most computationally intensive and slowest aspect of
fishHook analyses, though occurs within a few seconds for this small number of covariates.
## note how we can concatenate Covariate objects with c() operator fish$covariates = c(reptime, gc, hetchrom)
Now that we've added covariates, we can re-score fish
and compute p values. Looking at these
results, we see an improvement in lambda
(closer to one) and alpha
(increased), and the nominated gene list (no more olfactory receptors, We
see also a reasonable number of significant (fdr<0.1) genes
at the top of the list (or at the top right of the QQ plot) that have been
biologically implicated in lung adenocarcinoma tumorigenesis.
fish$score() fish$res %Q% (fdr<0.25) %Q% order(p) fish$qqp()
fish$score() gr2dt(fish$res)[order(p), ][fdr<0.25, ][, p:= as.character(p)] %>% kable() %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>% scroll_box(width = "100%", height = "200px") fish$qqp(plotly = FALSE) #p = fish$qqp() #p
Note that this model is still quite rudimentary (e.g. we have lumped together
SNVs and indels, we have not substratified SNVs by mutational context, we have
employed very few covariates) but we still get a reasonable gene list and minimal
statistical inflation.
We may notice that some of the significantly mutated genes are not known to
be expressed in lung adenocarcinoma tissues. We also know that genes that
are more highly expressed in a given tissue have lower mutation rates. We want to merge
in gene expression into the model, but not start from scratch (i.e. create a new model).
We can do this using the $merge
function.
## load GRanges of lung adenocarcinoma average gene experssion exprdata = readRDS(gzcon(file('http://mskilab.com/fishHook/hg19/lung.expr.rds'))) # log transform expression values exprdata$log.tpm = log10(exprdata$tpm+0.01) ## create Covariate for log gene expression, using log.tpm field in the exprdata object expr = Cov(exprdata, field = 'log.tpm', name = 'LungExpression') ## merge / append this new covariate into the FishHook object fish$merge(expr) ## look at the $data field to see the new merged covariate data as an additional column head(fish$data)
Now that we've merged gene expression into this model, we would like to apply it both as a covariate and as a gene filter. We would like to exclude from the analysis genes that are known to have poor expression in lung adenocarcinoma tissue, because these genes are unlikely to harbor driver alterations.
The dimension of the fish object is hypotheses by covariates. To subset rows (i.e. hypotheses) from this model we can use the subsetting feature of the FishHook object via the [
operator. We will use this functionality to apply a strict "expression filter" and keep only genes that are expressed >10 TPM in lung adenocarcinoma.
We will then re-score this subsetted model.
fish = fish[which(fish$data$LungExpression>1), ] ## subset for high lung expression
fish
This object now contains 8765 hypotheses (i.e. genes), which we can re-score to obtain P values and FDRs.
fish$score() fish$res %Q% (fdr<0.25) %Q% order(p) fish$qqp()
fish$score() gr2dt(fish$res)[order(p), ][fdr<0.25, ][, p:= as.character(p)] %>% kable() %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>% scroll_box(width = "100%", height = "200px") fish$qqp(plotly = FALSE) #p = fish$qqp() #p
To inspect the parameters of this model and see which features it is using, we
can employ the$model
accessor:
summary(fish$model)
We can see from the Estimate
and Pr(|>z|)
columns of the Coefficients
table that replication timing and lung gene expression are significantly
negatively correlated and heterochromatin is significantly positively correlated
with mutational density (as expected). However, this table shows that G
and C
content are not significantly correlated with
mutation density. We can use the column subsetting function to remove these covariates and re-score to see how the results change.
fish2 = fish[, -c(2:3)] ## remove the 2nd and 3rd covariates ('G', 'C') fish2$score() ## re-score ## check if our new top genes are identical to the previous - they are identical((fish2$res %Q% (fdr<0.25))$gene_name, (fish2$res %Q% (fdr<0.25))$gene_name)
The results are identical with and without G
and C
covariates. This further suggests that these covariates are not necessary in the model and can likely be excluded.
Read in and parse reactome pathways into list of gene symbols, then match against genes in our model.
## parse Reactome pathways from .gmt format pathways = strsplit(readLines('http://mskilab.com/fishHook/hg19/ReactomePathways.gmt'), '\t') pathways = structure(lapply(pathways, '[', -1), names = sapply(pathways, '[', 1)) ## match them to create sets of indices as a named list sets = sapply(sapply(pathways, match, fish$hypotheses$gene_name), setdiff, NA)
Here is what the pathways and sets look like:
head(pathways[1:2]) head(sets[1:2])
The list sets
contains integer vectors that index fish$hypotheses
.
To run a set analysis, we just set the $sets
variable in the FishHook
object.
This triggers scoring of hypothesis sets (in this case gene
sets). The results of the set analysis are shown in $setres variable.
The set analysis is a bit more computationally intensive. We can speed things up through parallelization (setting `fish$mc.cores = 5).
fish$mc.cores = 5 ## this triggers scoring of gene sets using covariate corrected model fish$sets = sets ## list table of results for top sets values(fish$setres %Q% order(p)[1:5])
fish$mc.cores = 50 fish$sets = sets head(gr2dt(values(fish$setres %Q% order(p)[1:5]))[, p:= as.character(p)]) %>% kable() %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>% scroll_box(width = "100%", height = "200px")
Examining the results table, we can see that most of the significant pathways appear
related to TP53. Indeed, if we inspect the hypotheses (i.e. genes) contributing to
these these gene sets, we will see that they are dominated by TP53 and 1-2
additional genes. For example:
## pick top set in setres, ## setres is a GRangesList, containing supporting hypotheses sorted by p value fish$setres %Q% order(p)[1]
gr2dt((fish$setres %Q% order(p)[1])[[1]])[, p:= as.character(p)] %>% kable() %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>% scroll_box(width = "100%", height = "200px")
This is a common challenge with pathway analysis of mutations, since many cancer pathways are usually driven by a single "celebrity" gene. We can dig a little deeper to identify significant gene sets that do not have
TP53 . Here is one approach:
## these are gene sets with TP53 has.tp53 = which(grl.eval(fish$setres, 'TP53' %in% gene_name)) ## we subset on gene sets that do not have TP53 values(fish$setres[-has.tp53, ] %Q% order(p)[1:5])
has.tp53 = which(grl.eval(fish$setres, 'TP53' %in% gene_name)) head(gr2dt(values(fish$setres[-has.tp53, ] %Q% order(p)[1:5]))[, p:= as.character(p)]) %>% kable() %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>% scroll_box(width = "100%", height = "200px")
These sets appear interesting and are related to additional biological processes known
to be important to lung adenocarcinoma biology, such EGFR signaling. The top gene set "Extracellular matrix organization" appears especially interesting, because it is not obviously associated with known targets of driver mutations in this disease. Let's examine genes contributing to this top gene set:
## inspect the non-TP53 associated gene set (GRangesList) fish$setres[-has.tp53, ] %Q% order(p)[1]
## use sets list to pull the individual hypothesis results for members of this set head(gr2dt(values((fish$setres[-has.tp53, ] %Q% order(p)[1])[[1]]))[, p:= as.character(p)]) %>% kable() %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>% scroll_box(width = "100%", height = "200px")
Interesting! This significant gene set is composed of extracellular matrix genes
(COL5A2, ITGAL) that are not significant on their own. This constitutes a
true pathway level hit.
We can run a similar analysis but choosing only truncating mutations (by subsetting the mutation GRanges). Scoring this new model, we obtain a different mutation list, containing likely candidate drivers with enrichment of frameshift, nonsense, or nonstop indels and SNVs.
## replace events with new subset of mutations (using %Q% subsetting operator from gUtils) fish$events = mutations %Q% (grepl('(Frame_Shift_)|(Nonsense)|(OutOfFrame)|(Nonstop)', Variant_Classification)) ## re-score model and inspect results fish$score() fish$res %Q% (fdr<0.25) %Q% order(p) fish$qqp()
fish$events = mutations %Q% (grepl('(Frame_Shift_)|(Nonsense)|(OutOfFrame)|(Nonstop)', Variant_Classification)) fish$score() gr2dt(fish$res)[order(p), ][fdr<0.25, ][, p:= as.character(p)] %>% kable() %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>% scroll_box(width = "100%", height = "200px") fish$qqp(plotly = FALSE) #p = fish$qqp() #p
Though TP53, RBM10, SETD2, and ARID1A are well known targets of truncating mutations in lung adenocarcinoma, IL32, TRIB1, SUN1 are interesting candidates that have not previously been associated with lung adenocarcinoma.
Next, we will analyze cancer WGS data that is inspired by the TCGA and NHGRI analyzed in Imielinski, Guo, Meyerson. Cell. 2017 Jan 26;168(3):460-472. The distributions of the mutations are "true", but (as per TCGA guidelines) the ids of cases and the base changes of these somatic mutations have been wiped / changed. However, the provided data are equivalent for identifying mutational hotspots.
## download (de-identified) WGS lung adenocarcinoma SNV's and indels (gzipped ## data.table, which we convert to GRanges with gUtils::dt2gr) muts = dt2gr(fread('http://mskilab.com/fishHook/hg19/luad.wgs.maf.gz'))
Let's separate mutations into SNV's and indels
snv = muts %Q% (Variant_Type == "SNP") indels = muts %Q% (Variant_Type %in% c("INS", "DEL))
We build a covariate model using the same covariates as above but instead of using the coding regions of genes, we will use evenly-spaced overlapping genomic tiles (which we call "supertiles")
## gr.tile gives us non-overlapping 10kb intervals that tile / bin the genome tiles = gr.tile(seqinfo(muts), 1e4) ## %+% shifts these intervals 5kb to the right and combines with the original ## set and so we will now have (almost) every region covered by two tiles supertiles = c(tiles, tiles %+% 5000)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.