In eolagbaju/ODER: Optimising the Definition of Expressed Regions

knitr::opts_chunk$set(
    collapse = TRUE,
    comment = "#>",
    crop = NULL
) # Related to https://stat.ethz.ch/pipermail/bioc-devel/2020-April/016656.html

## Track time spent on making the vignette
startTime <- Sys.time()

## Bib setup
library("RefManageR")

## Write bibliography information
bib <- c(
    R = citation(),
    BiocStyle = citation("BiocStyle")[1],
    knitr = citation("knitr")[1],
    RefManageR = citation("RefManageR")[1],
    rmarkdown = citation("rmarkdown")[1],
    sessioninfo = citation("sessioninfo")[1],
    testthat = citation("testthat")[1],
    ODER = citation("ODER")[1]
)

Basics

Install `ODER`

R is an open-source statistical environment which can be easily modified to enhance its functionality via packages. r Biocpkg("ODER") is a R package available via the Bioconductor repository for packages. R can be installed on any operating system from CRAN after which you can install r Biocpkg("ODER") by using the following commands in your R session:

if (!requireNamespace("BiocManager", quietly = TRUE)) {
    install.packages("BiocManager")
}

BiocManager::install("eolagbaju/ODER")

## Check that you have a valid Bioconductor installation
BiocManager::valid()

Required knowledge

The expected input of r Biocpkg("ODER") is coverage in the form of BigWig files as well as, depending on the functionalility required by a specific user, junctions in form of a RangedSummarizedExperiments.

r Biocpkg("ODER") is based on many other packages and in particular in those that have implemented the infrastructure needed for dealing with RNA-seq data. The r Biocpkg("GenomicRanges") package is heavily used in r Biocpkg("ODER") while other packages like r Biocpkg("SummarizedExperiment") and r Biocpkg("derfinder"). Previous experience with these packages will help in the comprehension and use of r Biocpkg("ODER").

If you are asking yourself the question "Where do I start using Bioconductor?" you might be interested in this blog post. If you find the structure of a r Biocpkg("SummarizedExperiment") unclear, you might consider checking out this manual.

Asking for help

As package developers, we try to explain clearly how to use our packages and in which order to use the functions. But R and Bioconductor have a steep learning curve so it is critical to learn where to ask for help. The blog post quoted above mentions some but we would like to highlight the Bioconductor support site as the main resource for getting help: remember to use the ODER tag and check the older posts. Other alternatives are available such as creating GitHub issues and tweeting. However, please note that if you want to receive help you should adhere to the posting guidelines. It is particularly critical that you provide a small reproducible example and your session information so package developers can track down the source of the error.

Citing `ODER`

We hope that r Biocpkg("ODER") will be useful for your research. Please use the following information to cite the package and the overall approach. Thank you!

## Citation info
citation("ODER")

Background

ODER is a packaged form of the method developed in the Zhang et al. 2020 publication: Incomplete annotation has a disproportionate impact on our understanding of Mendelian and complex neurogenetic disorders. The overarching aim of ODER is use RNA-sequencing data to define regions of unannotated expression (ERs) in the genome, optimise their definition, then link them to known genes.

ODER builds upon the method defined in r Biocpkg("derfinder") by improving the definition of ERs in a few ways. Firstly, rather than being a fixed value, the coverage cut off is optimised based on a set of user-inputted, gold-standard exons for a given set of samples. Secondly, ODER introduces a second optimisation parameter, the max region gap, which determines the number of base-pairs of gap between which ERs are merged. Thirdly, ERs can be connected to known genes through junction reads. This aids the interpretation of ERs and also allows their definition to be refined further using the intersection between ERs and junctions. For more detailed methods, please see the methods section of the original publication.

Unannotated ERs can provide evidence for the existence and location of novel exons, which are yet to be added within reference annotation. Improving the completeness of reference annotation can aid the interpretation of genetic variation. For example, the output of ODER can help to better interpret non-coding genetics variants that are found in the genome of Mendelian disease patients, poetentially leading to improvements in diagnosis rates.

Quick start to using to `ODER`

From the top-level ODER consists of 4 core functions, which are broken down internally into several smaller helper functions. These functions are expected to be run sequentially in the order presented below:

ODER() - Takes as input coverage in the form of BigWig files. Uses r Biocpkg("derfinder") to call contigous blocks of expression that we term expressed regions or ERs. ER definitions are optimised across a pair of parameters the mean coverage cut-off (MCC) and the max region gap (MRG) with respect to a user-inputted set of gold standard exons. The set of ERs for the optimised MCC and MRG are returned.
annotatER() - Takes as input the optimised set of ERs and a set of junctions. Finds overlaps between the ERs and junctions, thereby annotating ERs with the gene associated with it's corresponding junction. Also categorises ERs into "exon", "intron", "intergenic" or any combination of these three categories depending on the ERs overlap with existing annotation.
refine_ers() - Takes as input the optimised set of ERs annotated with junctions. Refines the ER definition based on the intersection between the ER and it's overlapping junction.
get_count_matrix() - Takes as input any set of GenomicRanges and a set of BigWig files. Returns a RangedSummarizedExperiment with assays containing the average coverage across each range. This function is intended to obtain the coverage across ERs to allow usage in downstream analyses such as differential expression.

Example

This is a basic example to show how you can use ODER. First, we need to download the example BigWig data required as input for ODER.

library("ODER")
library("magrittr")

# Download recount data in the form of BigWigs
gtex_metadata <- recount::all_metadata("gtex")

gtex_metadata <- gtex_metadata %>%
    as.data.frame() %>%
    dplyr::filter(project == "SRP012682")

url <- recount::download_study(
    project = "SRP012682",
    type = "samples",
    download = FALSE
)

# file_cache is an internal ODER function to cache files for
# faster repeated loading
bw_path <- file_cache(url[1])

bw_path

To get the optimum set of ERs from a BigWig file we can use the ODER() function.This will obtain the optimally defined ERs by finding the combination of MCC and MRG parameters that gives the lowest exon delta between the ERs and the inputted gold-standard exons. The MCC is minimum read depth that a base pair needs to have to be considered expressed. The MRG is the maximum number of base pairs between reads that fall below the MCC before you would not include it as part of the expressed region. Internally, gold-standard exons are obtained by finding the non-overlapping exons from the inputted reference annotation.

In this example, we demonstrate ODER() on a single unstranded Bigwig. However, in reality, it is likely that you will want to optimise the ER definitions across multiple BigWigs. It is worth noting that the arguments bw_pos and bw_neg in ODER() allow for the input of stranded BigWigs.

# load reference annotation from Ensembl
gtf_url <- paste0(
    "http://ftp.ensembl.org/pub/release-103/gtf/homo_sapiens",
    "/Homo_sapiens.GRCh38.103.chr.gtf.gz"
)
gtf_path <- file_cache(gtf_url)
gtf_gr <- rtracklayer::import(gtf_path)
# As of rtracklayer 1.25.16, BigWig is not supported on Windows.
if (!xfun::is_windows()) {
    opt_ers <- ODER(
        bw_paths = bw_path, auc_raw = gtex_metadata[["auc"]][1],
        auc_target = 40e6 * 100, chrs = c("chr21"),
        genome = "hg38", mccs = c(2, 4, 6, 8, 10), mrgs = c(10, 20, 30),
        gtf = gtf_gr, ucsc_chr = TRUE, ignore.strand = TRUE,
        exons_no_overlap = NULL, bw_chr = "chr"
    )
}

Once we have the obtained the optimised set of ERs, we may consider plotting the exon delta across the various MCCs and MRGs. This can be useful to check the error associated with the definition of the set of optimised ERs. This error is measured through two metrics; the median exon delta and the number of ERs with exon delta equal to 0. The median exon delta represents the overall accuracy of all ER definitions, whereas the number of ERs with exon delta equal to 0 indicates the extent to which ER definitions precisely match overlapping gold-standard exon boundaries.

# visualise the spread of mccs and mrgs
if (!xfun::is_windows()) { # uses product of ODER
    plot_ers(opt_ers[["deltas"]], opt_ers[["opt_mcc_mrg"]])
}

Next, we will use annotatERs() to find the overlap between the ERs and junctions. Furthermore, annotatERs() will also classify ERs by their overlap with existing reference annotation into the categories; "exon", "intron" and "intergenic". This can be helpful for two reasons. Primarily, junctions can be used to inform which gene the ER is associated to. This gene-level association can be useful multiple downstream applications, such as novel exon discovery. Furthermore, the category of ER, in terms of whether it overlaps a exonic, intronic or intergenic region, can help determine whether the ER represents novel expression. For example, ERs solely overlapping intronic or intergenic regions and associated with a gene can be the indication of the expression of an unannotated exon.

To note, it is recommended that the inputted junctions are derived from the same samples or tissue as the BigWig files used to define ERs.

library(utils)
# running only chr21 to reduce runtime
chrs_to_keep <- c("21")

# prepare the txdb object to create a genomic state
# based on https://support.bioconductor.org/p/93235/
hg38_chrominfo <- GenomeInfoDb::getChromInfoFromUCSC("hg38")

new_info <- hg38_chrominfo$size[match(
    chrs_to_keep,
    GenomeInfoDb::mapSeqlevels(hg38_chrominfo$chrom, "Ensembl")
)]

names(new_info) <- chrs_to_keep
gtf_gr_tx <- GenomeInfoDb::keepSeqlevels(gtf_gr,
    chrs_to_keep,
    pruning.mode = "tidy"
)

GenomeInfoDb::seqlengths(gtf_gr_tx) <- new_info
GenomeInfoDb::seqlevelsStyle(gtf_gr_tx) <- "UCSC"
GenomeInfoDb::genome(gtf_gr_tx) <- "hg38"

ucsc_txdb <- GenomicFeatures::makeTxDbFromGRanges(gtf_gr_tx)
genom_state <- derfinder::makeGenomicState(txdb = ucsc_txdb)

# convert UCSC chrs format to Ensembl to match the ERs
ens_txdb <- ucsc_txdb
GenomeInfoDb::seqlevelsStyle(ens_txdb) <- "Ensembl"

# lung_junc_21_22 is an example data set of junctions
# obtained from recount3, derived from the lung tissue
# run annotatERs()
data(lung_junc_21_22, package = "ODER")
if (!xfun::is_windows()) { # uses product of ODER
    annot_ers <- annotatERs(
        opt_ers = head(opt_ers[["opt_ers"]], 100),
        junc_data = lung_junc_21_22,
        genom_state = genom_state,
        gtf = gtf_gr,
        txdb = ens_txdb
    )

    # print first 5 ERs
    annot_ers[1:5]
}

After we have annotated ERs with the overlapping junctions, optionally we can use refine_ers() to refine the starts and ends of the ERs based on the overlapping junctions. This will filter ERs for those which have either a single or two non-intersecting junctions overlapping. For the remaining ERs, refine_ers() will shave the ER definitions to the exon boundaries matching the overlapping junctions. This can be useful for downstream applications whereby the accuracy of the ER definition is extremely important. For example, the interpretion of variants in diagnostic setting.

if (!xfun::is_windows()) { # uses product of ODER
    refined_ers <- refine_ERs(annot_ers)

    refined_ers
}

Finally, we can generate an ER count matrix with get_count_matrix(). This function can flexibly be run at any stage of the ODER pipeline and it requires a set of GenomicRanges and BigWig paths as input. get_count_matrix() will return a RangedSummarizedExperiment which has assays filled with the mean coverage across each inputted range. Internally, get_count_matrix() relies on r Biocpkg("megadepth") to obtain coverage from BigWigs therefore megadepth::install_megadepth() must be executed at least once on the user's system before get_count_matrix().

# create sample metadata containing identifiers for each BigWig
run <- gtex_metadata[["run"]][[1]]
col_info <- as.data.frame(run)

# install megadepth
megadepth::install_megadepth()
if (!xfun::is_windows()) { # uses product of ODER
    er_count_matrix <- get_count_matrix(
        bw_paths = bw_path, annot_ers = annot_ers,
        cols = col_info
    )

    er_count_matrix
}

Reproducibility

The r Biocpkg("ODER") package r Citep(bib[["ODER"]]) was made possible thanks to:

R r Citep(bib[["R"]])
r Biocpkg("BiocStyle") r Citep(bib[["BiocStyle"]])
r CRANpkg("RefManageR") r Citep(bib[["RefManageR"]])
r CRANpkg("knitr") r Citep(bib[["knitr"]])
r CRANpkg("rmarkdown") r Citep(bib[["rmarkdown"]])
r CRANpkg("sessioninfo") r Citep(bib[["sessioninfo"]])
r CRANpkg("testthat") r Citep(bib[["testthat"]])

This package was developed using r BiocStyle::Biocpkg("biocthis").

Date the vignette was generated.

## Date the vignette was generated
Sys.time()

Wallclock time spent generating the vignette.

## Processing time in seconds
totalTime <- diff(c(startTime, Sys.time()))
round(totalTime, digits = 3)

R session information.

## Session info
library("sessioninfo")
options(width = 120)
session_info()

eolagbaju/ODER documentation built on Dec. 20, 2021, 5:21 a.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

eolagbaju/ODER
Optimising the Definition of Expressed Regions

In eolagbaju/ODER: Optimising the Definition of Expressed Regions

Basics

Install `ODER`

Required knowledge

Asking for help

Citing `ODER`

Background

Quick start to using to `ODER`

Example

Reproducibility

R Package Documentation

Browse R Packages

We want your feedback!

eolagbaju/ODER Optimising the Definition of Expressed Regions

In eolagbaju/ODER: Optimising the Definition of Expressed Regions

Basics

Install ODER

Required knowledge

Asking for help

Citing ODER

Background

Quick start to using to ODER

Example

Reproducibility

R Package Documentation

Browse R Packages

We want your feedback!

eolagbaju/ODER
Optimising the Definition of Expressed Regions

Install `ODER`

Citing `ODER`

Quick start to using to `ODER`