knitr::opts_chunk$set( collapse = TRUE, comment = "#>", crop = NULL ) # Related to https://stat.ethz.ch/pipermail/bioc-devel/2020-April/016656.html
## Track time spent on making the vignette startTime <- Sys.time() ## Bib setup library("RefManageR") ## Write bibliography information bib <- c( R = citation(), BiocStyle = citation("BiocStyle")[1], knitr = citation("knitr")[1], RefManageR = citation("RefManageR")[1], rmarkdown = citation("rmarkdown")[1], sessioninfo = citation("sessioninfo")[1], testthat = citation("testthat")[1], ODER = citation("ODER")[1] )
ODER
R
is an open-source statistical environment which can be easily modified to
enhance its functionality via packages. r Biocpkg("ODER")
is a R
package
available via the Bioconductor repository for
packages. R
can be installed on any operating system from
CRAN after which you can install
r Biocpkg("ODER")
by using the following commands in your R
session:
if (!requireNamespace("BiocManager", quietly = TRUE)) { install.packages("BiocManager") } BiocManager::install("eolagbaju/ODER") ## Check that you have a valid Bioconductor installation BiocManager::valid()
The expected input of r Biocpkg("ODER")
is coverage in the form of BigWig
files as well as, depending on the functionalility required by a specific user,
junctions in form of a RangedSummarizedExperiments
.
r Biocpkg("ODER")
is based on many other packages and in particular in those
that have implemented the infrastructure needed for dealing with RNA-seq data.
The r Biocpkg("GenomicRanges")
package is heavily used in r Biocpkg("ODER")
while other packages like r Biocpkg("SummarizedExperiment")
and
r Biocpkg("derfinder")
. Previous experience with these packages will help in
the comprehension and use of r Biocpkg("ODER")
.
If you are asking yourself the question "Where do I start using Bioconductor?"
you might be interested in
this blog post.
If you find the structure of a r Biocpkg("SummarizedExperiment")
unclear, you
might consider checking out this manual.
As package developers, we try to explain clearly how to use our packages and in
which order to use the functions. But R
and Bioconductor
have a steep
learning curve so it is critical to learn where to ask for help. The blog post
quoted above mentions some but we would like to highlight the
Bioconductor support site as the main
resource for getting help: remember to use the ODER
tag and check
the older posts. Other alternatives
are available such as creating GitHub issues and tweeting. However, please note
that if you want to receive help you should adhere to the posting guidelines. It is
particularly critical that you provide a small reproducible example and your
session information so package developers can track down the source of the error.
ODER
We hope that r Biocpkg("ODER")
will be useful for your research. Please use
the following information to cite the package and the overall approach. Thank you!
## Citation info citation("ODER")
ODER
is a packaged form of the method developed in the Zhang et al. 2020 publication:
Incomplete annotation has a disproportionate impact on our understanding of Mendelian and complex neurogenetic disorders.
The overarching aim of ODER
is use RNA-sequencing data to define regions of
unannotated expression (ERs) in the genome, optimise their definition, then link
them to known genes.
ODER
builds upon the method defined in r Biocpkg("derfinder")
by improving
the definition of ERs in a few ways. Firstly, rather than being a fixed value,
the coverage cut off is optimised based on a set of user-inputted, gold-standard
exons for a given set of samples. Secondly, ODER
introduces a second
optimisation parameter, the max region gap, which determines the number of
base-pairs of gap between which ERs are merged. Thirdly, ERs can be connected to
known genes through junction reads. This aids the interpretation of ERs and also
allows their definition to be refined further using the intersection between ERs
and junctions. For more detailed methods, please see the methods section of the
original publication.
Unannotated ERs can provide evidence for the existence and location of novel
exons, which are yet to be added within reference annotation. Improving the
completeness of reference annotation can aid the interpretation of genetic
variation. For example, the output of ODER
can help to better interpret
non-coding genetics variants that are found in the genome of Mendelian disease
patients, poetentially leading to improvements in diagnosis rates.
ODER
From the top-level ODER consists of 4 core functions, which are broken down internally into several smaller helper functions. These functions are expected to be run sequentially in the order presented below:
ODER()
- Takes as input coverage in the form of BigWig files. Uses
r Biocpkg("derfinder")
to call contigous blocks of expression that we term
expressed regions or ERs. ER definitions are optimised across a pair of
parameters the mean coverage cut-off (MCC) and the max region gap (MRG) with
respect to a user-inputted set of gold standard exons. The set of ERs for the
optimised MCC and MRG are returned. annotatER()
- Takes as input the optimised set of ERs and a set of
junctions. Finds overlaps between the ERs and junctions, thereby annotating ERs
with the gene associated with it's corresponding junction. Also categorises ERs
into "exon", "intron", "intergenic" or any combination of these three categories
depending on the ERs overlap with existing annotation. refine_ers()
- Takes as input the optimised set of ERs annotated with
junctions. Refines the ER definition based on the intersection between the ER
and it's overlapping junction. get_count_matrix()
- Takes as input any set of GenomicRanges
and a set of
BigWig
files. Returns a RangedSummarizedExperiment
with assays
containing
the average coverage across each range. This function is intended to obtain the
coverage across ERs to allow usage in downstream analyses such as differential
expression. This is a basic example to show how you can use ODER. First, we need to download
the example BigWig
data required as input for ODER
.
library("ODER") library("magrittr") # Download recount data in the form of BigWigs gtex_metadata <- recount::all_metadata("gtex") gtex_metadata <- gtex_metadata %>% as.data.frame() %>% dplyr::filter(project == "SRP012682") url <- recount::download_study( project = "SRP012682", type = "samples", download = FALSE ) # file_cache is an internal ODER function to cache files for # faster repeated loading bw_path <- file_cache(url[1]) bw_path
To get the optimum set of ERs from a BigWig file we can use the ODER()
function.This will obtain the optimally defined ERs by finding the combination
of MCC and MRG parameters that gives the lowest exon delta between the ERs and
the inputted gold-standard exons. The MCC is minimum read depth that a base pair
needs to have to be considered expressed. The MRG is the maximum number of base
pairs between reads that fall below the MCC before you would not include it as
part of the expressed region. Internally, gold-standard exons are obtained by
finding the non-overlapping exons from the inputted reference annotation.
In this example, we demonstrate ODER()
on a single unstranded Bigwig
.
However, in reality, it is likely that you will want to optimise the ER
definitions across multiple BigWigs
. It is worth noting that the arguments
bw_pos
and bw_neg
in ODER()
allow for the input of stranded BigWigs
.
# load reference annotation from Ensembl gtf_url <- paste0( "http://ftp.ensembl.org/pub/release-103/gtf/homo_sapiens", "/Homo_sapiens.GRCh38.103.chr.gtf.gz" ) gtf_path <- file_cache(gtf_url) gtf_gr <- rtracklayer::import(gtf_path) # As of rtracklayer 1.25.16, BigWig is not supported on Windows. if (!xfun::is_windows()) { opt_ers <- ODER( bw_paths = bw_path, auc_raw = gtex_metadata[["auc"]][1], auc_target = 40e6 * 100, chrs = c("chr21"), genome = "hg38", mccs = c(2, 4, 6, 8, 10), mrgs = c(10, 20, 30), gtf = gtf_gr, ucsc_chr = TRUE, ignore.strand = TRUE, exons_no_overlap = NULL, bw_chr = "chr" ) }
Once we have the obtained the optimised set of ERs, we may consider plotting the exon delta across the various MCCs and MRGs. This can be useful to check the error associated with the definition of the set of optimised ERs. This error is measured through two metrics; the median exon delta and the number of ERs with exon delta equal to 0. The median exon delta represents the overall accuracy of all ER definitions, whereas the number of ERs with exon delta equal to 0 indicates the extent to which ER definitions precisely match overlapping gold-standard exon boundaries.
# visualise the spread of mccs and mrgs if (!xfun::is_windows()) { # uses product of ODER plot_ers(opt_ers[["deltas"]], opt_ers[["opt_mcc_mrg"]]) }
Next, we will use annotatERs()
to find the overlap between the ERs and
junctions. Furthermore, annotatERs()
will also classify ERs by their overlap
with existing reference annotation into the categories; "exon", "intron" and
"intergenic". This can be helpful for two reasons. Primarily, junctions can be
used to inform which gene the ER is associated to. This gene-level association
can be useful multiple downstream applications, such as novel exon discovery.
Furthermore, the category of ER, in terms of whether it overlaps a exonic,
intronic or intergenic region, can help determine whether the ER represents
novel expression. For example, ERs solely overlapping intronic or intergenic
regions and associated with a gene can be the indication of the expression of
an unannotated exon.
To note, it is recommended that the inputted junctions are derived from the same
samples or tissue as the BigWig
files used to define ERs.
library(utils) # running only chr21 to reduce runtime chrs_to_keep <- c("21") # prepare the txdb object to create a genomic state # based on https://support.bioconductor.org/p/93235/ hg38_chrominfo <- GenomeInfoDb::getChromInfoFromUCSC("hg38") new_info <- hg38_chrominfo$size[match( chrs_to_keep, GenomeInfoDb::mapSeqlevels(hg38_chrominfo$chrom, "Ensembl") )] names(new_info) <- chrs_to_keep gtf_gr_tx <- GenomeInfoDb::keepSeqlevels(gtf_gr, chrs_to_keep, pruning.mode = "tidy" ) GenomeInfoDb::seqlengths(gtf_gr_tx) <- new_info GenomeInfoDb::seqlevelsStyle(gtf_gr_tx) <- "UCSC" GenomeInfoDb::genome(gtf_gr_tx) <- "hg38" ucsc_txdb <- GenomicFeatures::makeTxDbFromGRanges(gtf_gr_tx) genom_state <- derfinder::makeGenomicState(txdb = ucsc_txdb) # convert UCSC chrs format to Ensembl to match the ERs ens_txdb <- ucsc_txdb GenomeInfoDb::seqlevelsStyle(ens_txdb) <- "Ensembl" # lung_junc_21_22 is an example data set of junctions # obtained from recount3, derived from the lung tissue # run annotatERs() data(lung_junc_21_22, package = "ODER") if (!xfun::is_windows()) { # uses product of ODER annot_ers <- annotatERs( opt_ers = head(opt_ers[["opt_ers"]], 100), junc_data = lung_junc_21_22, genom_state = genom_state, gtf = gtf_gr, txdb = ens_txdb ) # print first 5 ERs annot_ers[1:5] }
After we have annotated ERs with the overlapping junctions, optionally we can
use refine_ers()
to refine the starts and ends of the ERs based on the
overlapping junctions. This will filter ERs for those which have either a single
or two non-intersecting junctions overlapping. For the remaining ERs,
refine_ers()
will shave the ER definitions to the exon boundaries matching the
overlapping junctions. This can be useful for downstream applications whereby
the accuracy of the ER definition is extremely important. For example, the
interpretion of variants in diagnostic setting.
if (!xfun::is_windows()) { # uses product of ODER refined_ers <- refine_ERs(annot_ers) refined_ers }
Finally, we can generate an ER count matrix with get_count_matrix()
. This
function can flexibly be run at any stage of the ODER
pipeline and it requires
a set of GenomicRanges
and BigWig
paths as input. get_count_matrix()
will
return a RangedSummarizedExperiment
which has assays
filled with the mean
coverage across each inputted range. Internally, get_count_matrix()
relies on
r Biocpkg("megadepth")
to obtain coverage from BigWigs
therefore
megadepth::install_megadepth()
must be executed at least once on the user's
system before get_count_matrix()
.
# create sample metadata containing identifiers for each BigWig run <- gtex_metadata[["run"]][[1]] col_info <- as.data.frame(run) # install megadepth megadepth::install_megadepth() if (!xfun::is_windows()) { # uses product of ODER er_count_matrix <- get_count_matrix( bw_paths = bw_path, annot_ers = annot_ers, cols = col_info ) er_count_matrix }
The r Biocpkg("ODER")
package r Citep(bib[["ODER"]])
was made possible
thanks to:
r Citep(bib[["R"]])
r Biocpkg("BiocStyle")
r Citep(bib[["BiocStyle"]])
r CRANpkg("RefManageR")
r Citep(bib[["RefManageR"]])
r CRANpkg("knitr")
r Citep(bib[["knitr"]])
r CRANpkg("rmarkdown")
r Citep(bib[["rmarkdown"]])
r CRANpkg("sessioninfo")
r Citep(bib[["sessioninfo"]])
r CRANpkg("testthat")
r Citep(bib[["testthat"]])
This package was developed using r BiocStyle::Biocpkg("biocthis")
.
Date the vignette was generated.
## Date the vignette was generated Sys.time()
Wallclock time spent generating the vignette.
## Processing time in seconds totalTime <- diff(c(startTime, Sys.time())) round(totalTime, digits = 3)
R
session information.
## Session info library("sessioninfo") options(width = 120) session_info()
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.