knitr::opts_chunk$set( collapse = TRUE, comment = "#>", crop = NULL ## Related to https://stat.ethz.ch/pipermail/bioc-devel/2020-April/016656.html )
## Track time spent on making the vignette startTime <- Sys.time() ## Bib setup library("knitcitations") ## Load knitcitations with a clean bibliography cleanbib() cite_options(hyperlink = "to.doc", citation_format = "text", style = "html") ## Write bibliography information bib <- c( R = citation(), BiocStyle = citation("BiocStyle")[1], knitcitations = citation("knitcitations")[1], knitr = citation("knitr")[1], rmarkdown = citation("rmarkdown")[1], sessioninfo = citation("sessioninfo")[1], testthat = citation("testthat")[1], dasper = citation("dasper")[1] ) write.bibtex(bib, file = "dasper.bib")
dasper
R
is an open-source statistical environment which can be easily modified to enhance its functionality via packages. r Biocpkg("dasper")
is a R
package available via the Bioconductor repository for packages. R
can be installed on any operating system from CRAN after which you can install r Biocpkg("dasper")
by using the following commands in your R
session:
if (!requireNamespace("BiocManager", quietly = TRUE)) { install.packages("BiocManager") } BiocManager::install("dzhang32/dasper") ## Check that you have a valid Bioconductor installation BiocManager::valid()
The expected input of r Biocpkg("dasper")
are junctions reads (e.g. directly outputted from an aligner such as STAR or extracted from a BAM file (e.g. using megadepth) and coverage in the form of BigWig files (which can be generated from BAM files using megadepth or RSeQC). r Biocpkg("dasper")
is based on many other packages and in particular in those that have implemented the infrastructure needed for dealing with RNA-sequencing data. The packages r Biocpkg("SummarizedExperiment")
and r Biocpkg("GenomicRanges")
are used throughout, therefore familiarity with these packages will greatly help in interpreting the output of r Biocpkg("dasper")
.
If you are asking yourself the question "Where do I start using Bioconductor?" you might be interested in this blog post. Or if you find the structure of a r Biocpkg("SummarizedExperiment")
unclear, you might consider checking out this manual.
As package developers, we try to explain clearly how to use our packages and in which order to use the functions. But R
and Bioconductor
have a steep learning curve so it is critical to learn where to ask for help. The blog post quoted above mentions some but we would like to highlight the Bioconductor support site as the main resource for getting help: remember to use the dasper
tag and check the older posts. Other alternatives are available such as creating GitHub issues and tweeting. However, please note that if you want to receive help you should adhere to the posting guidelines. It is particularly critical that you provide a small reproducible example and your session information so package developers can track down the source of the error.
dasper
We hope that r Biocpkg("dasper")
will be useful for your research. Please use the following information to cite the package and the overall approach. Thank you!
## Citation info citation("dasper")
knitr::include_graphics("dasper_workflow.png")
The above workflow diagram gives a top-level overview of the available functions within dasper
and describes the order in which they are intended to be run. This are broadly split into 4 categories:
RangedSummarizedExperiment
object, annotate your junctions using reference annotation, filter out junctions for those that likely originate from technical noise and normalize your junction counts to allow for comparison between samples.plot_sashimi
. This enables you to select a gene, transcript or region of interest and plot the splicing and coverage across that region in the form of sashimi plot.dasper
includes wrapper functions for the 3 major analysis steps in the workflow - processing junction data, processing coverage data, then performing outlier detection. If you are familiar with dasper
or uninterested with the intermediates, you can go from start to finish using these wrappers.
First, we need to install the system dependency megadepth, which is required for the loading of coverage from BigWig files. The easiest way to do this is to run install_megadepth
from the r Biocpkg("megadepth")
megadepth::install_megadepth()
dasper
requires reference annotation that can either be inputted as a GTF
path or pre-loaded into a TxDb
object as shown below.
# use GenomicState to load txdb (GENCODE v31) ref <- GenomicState::GenomicStateHub(version = "31", genome = "hg38", filetype = "TxDb")[[1]]
RNA-seq data in the form of junctions and BigWig files are the required raw input to dasper
. For illustration, in this vignette we will use an BigWig file from the GTEx v6 data publicly hosted by recount2. In reality, users should use the BigWig files that correspond to the same samples as those your junction data originate from.
# Obtain the urls to the remotely hosted GTEx BigWig files url <- recount::download_study( project = "SRP012682", type = "samples", download = FALSE ) # cache the file using BiocFileCache for faster retrieval bw_path <- dasper:::.file_cache(url[1])
A small set of example junctions are included within the package, which can be used for testing the functionality of dasper
.
library(dasper)
junctions_example
dasper
Here, we run the 3 wrapper functions (junction_process
, coverage_process
and outlier_process
) on junctions_example
. These functions are designed to be compatible with usage of the %>%
pipe from tidyverse. The time-limiting steps of this analysis can be run parallel using r Biocpkg("BiocParallel")
as shown.
library(magrittr) outlier_scores <- junction_process(junctions_example, ref) %>% coverage_process(ref, coverage_paths_case = rep(bw_path, 2), coverage_paths_control = rep(bw_path, 3), bp_param = BiocParallel::MulticoreParam(5) ) %>% outlier_process(bp_param = BiocParallel::MulticoreParam(5))
In this section, we will run each of the 1-9 functions shown in workflow. This can be helpful for users that want to understand or modify the intermediates of the dasper
pipeline or are only interested in a executing a specific step (e.g. annotating junctions using junction_annot
). If you want to follow along and run this code in your R session, make sure you have followed the instructions in the section setup.
The first step is to load in patient (and control) junction data using junction_load
. By default, this expects junction files in the format outputted by the STAR aligner (SJ.out). If the user's junctions are in a different format you can accommodate for this by adjusting the load_func
argument. The output of this will be junctions stored in a RangedSummarizedExperiment
object required for the downstream dasper
functions.
Additionally, users can choose to add a set of control junctions from GTEx v6 (publicly available through recount2) using the argument controls
. This may be useful if the user only has RNA-seq data from only a single or small number of patients, which is often the case in a diagnostic setting. In this example, we arbitrarily chose to use the GTEx junctions from the fibroblast tissue. However, we would recommend matching the control tissue to the tissue from which your patient RNA-seq data was derived. The current available tissues that be selected via controls
are "fibroblasts", "lymphocytes", "skeletal_muscle" and "whole_blood".
junctions_example_1_path <- system.file( "extdata", "junctions_example_1.txt", package = "dasper", mustWork = TRUE ) junctions_example_2_path <- system.file( "extdata", "junctions_example_2.txt", package = "dasper", mustWork = TRUE ) # only keep chromosomes 21 + 22 for speed junctions <- junction_load( junction_paths = c( junctions_example_1_path, junctions_example_2_path ), controls = "fibroblasts", chrs = c("chr21", "chr22") ) junctions
Next, we will annotate our junctions using junction_annot
. Using reference annotation inputted via ref
, this will classify junctions into the categories "annotated", "novel_acceptor", "novel_donor", "novel_combo", "novel_exon_skip", "ambig_gene" and "unannotated". Additionally, the genes/transcripts/exons that each junction overlaps will be stored in the SummarizedExperiment::rowData
. This information is necessary for the downstream dasper
functions but can also be useful independently if user's are interested in for example, only retrieving the annotated/unannotated junctions.
# take only the first 5 samples (2 cases, 3 controls) # to increase the speed of the vignette junctions <- junctions[, 1:5] junctions <- junction_annot(junctions, ref) head(SummarizedExperiment::rowData(junctions))
We will then filter our junctions using junction_filter
. You have the option of choosing to filter by count, width, type (annotation category) and whether junctions overlap a particular genomic region. The default settings shown below filter only by count and require a junction to have at least 5 reads in at least 1 sample to be kept in.
junctions <- junction_filter(junctions, count_thresh = c(raw = 5), n_samp = c(raw = 1) ) junctions
We can then normalize our raw junctions counts into a proportion-spliced-in using junction_norm
. This will first cluster each junction by finding all other junctions that share an acceptor or donor site with it. Then, calculate the normalized counts by dividing the number of reads supporting each junction with the total number of reads in it's associated cluster.
junctions <- junction_norm(junctions) # save a separate object for plotting junctions_normed <- junctions # show the raw counts tail(SummarizedExperiment::assays(junctions)[["raw"]]) # and those after normalisation tail(SummarizedExperiment::assays(junctions)[["norm"]])
Finally, we need to score each patient junction by how much it's (normalized) counts deviate from the count distribution of the same junction in the control samples using junction_score
. By default, this uses a z-score measure, however this can be modified to a user-inputted function by adjusting the argument score_func
. These junction scores with be stored in the SummarizedExperiment::assays
"score".
junctions <- junction_score(junctions) names(SummarizedExperiment::assays(junctions))
We then move on to loading and normalising the coverage across regions associated with each junction using coverage_norm
. Namely, these are the 2 exonic regions flanking each junction and the intron inbetween. Given that there are 3 regions for each junction and we need to obtain coverage for every junction from every sample, this step can be very computationally intense to run. Here, dasper
allows parrallelisation across samples using r Biocpkg("BiocParallel")
and uses the tool megadepth developed by Chris Wilks, which is significantly faster than other tools (rtracklayer
and pyBigWig
) for loading coverage from BigWig files (see runtime comparison).
coverage <- coverage_norm( junctions, ref, coverage_paths_case = rep(bw_path, 2), coverage_paths_control = rep(bw_path, 3), bp_param = BiocParallel::SerialParam() )
coverage_score
can then be used to compare the coverage associated with each junction to the coverage distribution corresponding to the same region in controls. After which, junctions should have the SummarizedExperiment::assays
"coverage_score".
junctions <- coverage_score(junctions, coverage) names(SummarizedExperiment::assays(junctions))
The last step in the dasper
pipeline is to use the "scores" and "coverage scores" as input into an outlier detection model. This involves using outlier_detect
to score each junction in each sample by how aberrant it looks. The outputted scores will be stored in the SummarizedExperiment::assays
"outlier_score". This step can be can be parallelised using r Biocpkg("BiocParallel")
. The fit of the isolation forest
may fluctuate depending on it's intial random_state
. For reproducibility you can avoid this by setting the parameter random_state
to an fixed integer (for more details on the isolation forest model see https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.IsolationForest.html).
junctions <- outlier_detect( junctions, bp_param = BiocParallel::SerialParam(), random_state = 32L ) names(SummarizedExperiment::assays(junctions))
Finally, outlier_aggregate
will aggregate the junction-level "outlier_scores" to a cluster level. The returned outlier_scores
object contains a DataFrame
with each row detailing a cluster in terms of how aberrant it looks. The gene_id_cluster
column describes the gene linked to each cluster, derived from the gene annotation returned by junction_annot
. Splicing events are ranked in the rank
column with 1 referring to the most aberrant splicing event in that patient. Details of the specific junctions that look the most aberrant for each cluster can be found in the junctions
column.
outlier_scores <- outlier_aggregate(junctions, samp_id_col = "samp_id", ) head(outlier_scores)
In order to help further inspection of the candidate genes with abberrant splicing returned by dasper
, we also include functions to visualise splicing across genes/transcripts/regions of interest. For example, you could visualise the gene with the most aberrant cluster in samp_1
.
# take gene_id of the cluster ranked 1 gene_id <- unlist(outlier_scores[["gene_id_cluster"]][1]) plot_sashimi( junctions_normed, ref, case_id = list(samp_id = "samp_1"), gene_tx_id = gene_id, count_label = FALSE )
Often regions of splicing can be very complex within a gene/transcript. In which case, you may consider zooming in on a specific region of interest using the argument region
. Furthermore, the thickness of the lines represents the normalized counts of each junction however in this way, it can be difficult to accurately differentiate junctions that have similar counts. Users may want to add a label representing the count of each junction using the argument count_label
.
plot_sashimi(junctions_normed, ref, case_id = list(samp_id = "samp_1"), gene_tx_id = gene_id, region = GenomicRanges::GRanges("chr22:42620000-42630000"), count_label = TRUE )
The r Biocpkg("dasper")
package r citep(bib[["dasper"]])
was made possible thanks to:
r citep(bib[["R"]])
r Biocpkg("BiocStyle")
r citep(bib[["BiocStyle"]])
r CRANpkg("knitcitations")
r citep(bib[["knitcitations"]])
r CRANpkg("knitr")
r citep(bib[["knitr"]])
r CRANpkg("rmarkdown")
r citep(bib[["rmarkdown"]])
r CRANpkg("sessioninfo")
r citep(bib[["sessioninfo"]])
r CRANpkg("testthat")
r citep(bib[["testthat"]])
This package was developed using r BiocStyle::Githubpkg("lcolladotor/biocthis")
.
Code for creating the vignette
## Create the vignette library("rmarkdown") system.time(render("dasper.Rmd", "BiocStyle::html_document")) ## Extract the R code library("knitr") knit("dasper.Rmd", tangle = TRUE)
## Clean up file.remove("dasper.bib")
Date the vignette was generated.
## Date the vignette was generated Sys.time()
Wallclock time spent generating the vignette.
## Processing time in seconds totalTime <- diff(c(startTime, Sys.time())) round(totalTime, digits = 3)
R
session information.
## Session info library("sessioninfo") options(width = 120) session_info()
This vignette was generated using r Biocpkg("BiocStyle")
r citep(bib[["BiocStyle"]])
with r CRANpkg("knitr")
r citep(bib[["knitr"]])
and r CRANpkg("rmarkdown")
r citep(bib[["rmarkdown"]])
running behind the scenes.
Citations made with r CRANpkg("knitcitations")
r citep(bib[["knitcitations"]])
.
## Print bibliography knitcitations::bibliography()
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.