knitr::opts_chunk$set(warning = FALSE, message = FALSE)
In this vignette, we demonstrate how to use annoFuse to filter putative oncogenic fusions from a filtered fusion list followed by plotting recurrent fusions, recurrently fused genes, as well as a summary plot.
We start by loading annoFuse
and the other required packages in the chunk below
library("annoFuse") suppressPackageStartupMessages(library("readr")) suppressPackageStartupMessages(library("dplyr")) suppressPackageStartupMessages(library("qdapRegex"))
Here, we present annoFuse, an R package developed to annotate and filter expressed gene fusions, along with highlighting artifact filtered novel fusions.
FilteredFusionAnnoFuse.tsv
was obtained from Star-Fusion, arriba and rsem expression (fpkm) files from release-v16-20200320 of OpenPBTA .
We used FusionAnnotator to annotate arriba files to specifically identify "red flag" fusions found in healthy tissues or in gene homology databases saved as column "annots" to match annotation in STAR-Fusion calls.
The merged dataset was processed using the following steps fusion_standardization
-> fusion_filtering_QC
-> expression_filter_fusion
-> annotate_fusion_calls
).
Please refer to the filtering criterias available per function below:For OpenPBTA samples we will also require some project specific fields to visualize and explore the data
fusion_calls <- read_tsv(system.file("extdata", "FilteredFusionAnnoFuse.tsv", package = "annoFuseData")) # distance are being removed here to capture all intergenic fusions in count # distance within () were making them count as unique instead of the same fusion fusion_calls$FusionName <- unlist(lapply( fusion_calls$FusionName, function(x) rm_between(x, "(", ")", extract = FALSE) )) cols_fusioncalls <- c("LeftBreakpoint", "RightBreakpoint", "FusionName", "Gene1A", "Gene1B","Sample") head(fusion_calls[,cols_fusioncalls])
If the fusion has been previously reported in TCGA or if fusions containing gene partners are known oncogenes, tumor suppressor genes, COSMIC genes, and/or transcription factors, then we consider the fusion call to be known oncogenic or putative oncogenic.
# Add reference gene list containing known oncogenes, tumor suppressors, kinases, and transcription factors geneListReferenceDataTab <- read.delim( system.file("extdata", "genelistreference.txt", package = "annoFuseData"), stringsAsFactors = FALSE ) # Add fusion list containing previously reported oncogenic fusions. fusionReferenceDataTab <- read.delim( system.file("extdata", "fusionreference.txt", package = "annoFuseData"), stringsAsFactors = FALSE ) # filter for driver fusions putative_driver_fusions <- fusion_driver( standardFusioncalls = fusion_calls, annotated = TRUE, geneListReferenceDataTab = geneListReferenceDataTab, fusionReferenceDataTab = fusionReferenceDataTab,checkDomainStatus = TRUE ) # aggregate caller putative_driver_fusions <- aggregate_fusion_calls(putative_driver_fusions, removeother = FALSE)
Putative Driver Fusions found in more than four distinct histologies were filtered out, as these fusions were considered likely artifactual.
# checking if fusions are called in multiple histologies # this will suggest these fusion calls are artifactual or # commonly occuring found_in_morethan_4 <- groupcount_fusion_calls( putative_driver_fusions, group = "broad_histology", numGroup = 4 ) %>% arrange(desc(group.ct)) found_in_morethan_4
Other non-oncogenic (not annotated as oncogene/transcription factor/tumor suppressor gene/kinase) fusions that are recurrent fusions, called by both callers, and are unique to a specific histology in this filtered dataset are of interest as well.
However, non-oncogenic genes that are fused more than five times per sample were removed, as they were considered artifactual within our dataset from manual review.
# To scavenge back non-oncogenic recurrent/unique per broad histology fusions fusion_calls <- aggregate_fusion_calls( fusion_calls, removeother = TRUE, filterAnnots = "LOCAL_REARRANGEMENT|LOCAL_INVERSION" ) # Keep # 1. Called by at least n callers fusion_calls.summary <- called_by_n_callers(fusion_calls, numCaller = 2) # OR # 2. Found in at least n samples in each group sample.count <- samplecount_fusion_calls(fusion_calls, numSample = 2, group = "broad_histology") # Remove # 1. non-oncogenic fusions that are in > numGroup group.count <- groupcount_fusion_calls(fusion_calls, group = "broad_histology", numGroup = 1) # 2. non-oncogenic multi-fused genes fusion_recurrent5_per_sample <- fusion_multifused(fusion_calls, limitMultiFused = 5) # filter fusion_calls to keep recurrent fusions from above sample.count and fusion_calls.summary QCGeneFiltered_recFusion <- fusion_calls %>% dplyr::filter(FusionName %in% unique(c(sample.count$FusionName, fusion_calls.summary$FusionName))) # filter QCGeneFiltered_recFusion to remove fusions found in more than 1 group and multifused gene per samples QCGeneFiltered_recFusionUniq <- QCGeneFiltered_recFusion %>% dplyr::filter(!FusionName %in% group.count$FusionName) %>% dplyr::filter(!Gene1A %in% fusion_recurrent5_per_sample$GeneSymbol | !Gene2A %in% fusion_recurrent5_per_sample$GeneSymbol | !Gene1B %in% fusion_recurrent5_per_sample$GeneSymbol | !Gene2B %in% fusion_recurrent5_per_sample$GeneSymbol) # Check for domain retention QCGeneFiltered_recFusionUniq <- fusion_driver( standardFusioncalls = QCGeneFiltered_recFusionUniq, geneListReferenceDataTab = geneListReferenceDataTab, fusionReferenceDataTab = fusionReferenceDataTab, # don't filter because these are non-oncogenic fusions filterPutativeDriver = FALSE, checkDomainStatus = TRUE, annotated = TRUE )
Here, we visualize the distribution of fusions within two different genes that are annotated as "Genic" in our standard fusion format. Fusions that are in-frame are predicted to translate into protein, so we also filter for only in-frame fusions for visualization.
putative_driver_fusions <- putative_driver_fusions %>% dplyr::bind_rows(QCGeneFiltered_recFusionUniq[, colnames(putative_driver_fusions)]) %>% # filter fusions found in more than 4 histologies dplyr::filter( !FusionName %in% found_in_morethan_4$FusionName, # remove intergenic and intragenic fusion BreakpointLocation == "Genic", Fusion_Type == "in-frame" ) %>% as.data.frame() head(putative_driver_fusions)
The summary of fusions called can provide an overview of the genomic rearrangements within cohort in-terms of intra or inter chromosomal changes and biotypes of genes fused. We also provide distribution of kinase domain retained in 5and 3
genes as well as overall distribution of annotation in each group. Here we've used broad_histology to group our cohorts.
# plot summary plot_summary(standardFusioncalls = putative_driver_fusions, groupby = "broad_histology")
Recurrent fusions and genes fused provide insights into subtypes of samples within a cohort.
Here we choose broad_histology
as the grouping variable and plot the n
participants (Kids_First_Participant_ID) with these fusions.
# recurrently fused genes plot_recurrent_genes(standardFusioncalls = putative_driver_fusions, groupby = "broad_histology", countID = "Kids_First_Participant_ID", plotn = 20)
# recurrent fusions plot_recurrent_fusions(standardFusioncalls = putative_driver_fusions, groupby = "broad_histology", countID = "Kids_First_Participant_ID", plotn = 20)
sessionInfo()
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.