In csoneson/jcc: Calculate JCC Scores

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)

Introduction

This vignette describes how to use the jcc package to calculate junction coverage compatibility (JCC_ scores [@soneson2018jcc] for a given set of genes. In order to run the entire workflow, the following files are required, and will be used as described below:

A BAM file with reads aligned to a reference genome. The package has been tested with BAM files generated by the STAR aligner [@dobin2013star], but files generated by other genome aligners should also work. The reads in the BAM file will be used to fit the fragment bias model with r Biocpkg("alpine") [@love2016alpine] as well as to obtain the predicted transcript coverage profiles.
A data.frame with the number of uniquely mapping and multimapping reads aligned across each annotated splice junction. Below we show how to generate a correctly formatted data frame from the SJ.out.tab text file generated by STAR.
A BSgenome object for the genome that the reads in the BAM file were aligned to.
A gtf file corresponding to the BSgenome and the BAM file.
A transcript-to-gene conversion data.frame with at least two columns, named tx and gene. Additional columns providing, e.g., gene symbols can be included.
A data.frame with estimated transcript abundances, with three columns named transcript, TPM and count.
If you want to create gene summary plots, including an illustration of the observed coverage in a genomic region, you also need to convert the BAM file into a bigWig file. This can be done e.g. as follows:

```{bash, eval=FALSE} bedtools genomecov -split -ibam -bg > bedGraphToBigWig

where you should replace the text in `<>` by the appropriate file names. 
The `<chrNameLength.txt>` is a text file with the length of each chromosome. 
If you created a genome index with STAR, the index directory contains such a
file.
The information can also be extracted from the header of the BAM file. 

The workflow has mainly been tested with reference files from Ensembl, and may
need to be adjusted to work with reference files in other formats.

The general procedure for estimating junction coverage compatibility (JCC)
scores for a set of genes is outlined in the figure below.
Please refer to the paper [@soneson2018jcc] for a more detailed description and
motivation of the different steps.

<img src="study_design.png" alt="drawing" width="400px"/>

In the rest of this vignette, we will illustrate each step using a small example
data set provided with the package.

# Fit fragment bias model

The first step in the process is to fit a fragment bias model using the 
`r Biocpkg("alpine")` package.
The `fitAlpineBiasModel()` function provides a wrapper containing the necessary
steps.
Note that the parameters of this function may need to be adapted to be
appropriate for a specific data set. 
In particular, since the example data set only contains a small number of reads
and genes, we set `minCount` and `minLength` to small values to make sure that
some genes will be retained and used to fit the fragment bias model.
However, in a real data set these arguments can often be set to higher values. 
Also, the `BSgenome` object (here, `Hsapiens` from the
`BSgenome.Hsapiens.NCBI.GRCh38` package) should be replaced with the appropriate
object for your organism.
We refer to the help file and the `r Biocpkg("alpine")` package for more
detailed explanations of each parameter.

```r
## Load necessary packages
suppressPackageStartupMessages({
  library(jcc)
  library(dplyr)
  library(BSgenome.Hsapiens.NCBI.GRCh38)
})

## Load gtf file and bam file with aligned reads
gtf <- system.file("extdata/Homo_sapiens.GRCh38.90.chr22.gtf.gz", 
                   package = "jcc")
bam <- system.file("extdata/reads.chr22.bam", package = "jcc")

## Fit fragment bias model
biasMod <- fitAlpineBiasModel(gtf = gtf, bam = bam, organism = "Homo_sapiens",
                              genome = Hsapiens, genomeVersion = "GRCh38",
                              version = 90, minLength = 230, maxLength = 7000,
                              minCount = 10, maxCount = 10000, 
                              subsample = TRUE, nbrSubsample = 30, seed = 1,
                              minSize = NULL, maxSize = 220, 
                              verbose = TRUE)

The output of fitAlpineBiasModel is a list with three objects:

biasModel: the fitted fragment bias model from r Biocpkg("alpine").
exonsByTx: a GRangesList, grouping exons by transcript.
transcripts: a GRanges object with all reference transcripts.

names(biasMod)

biasMod$exonsByTx

biasMod$transcripts

Predict transcript coverage profiles

Next, we use the fitted fragment bias model to predict the coverage profile for each transcript in a set of specified genes. The genes argument can be set to NULL to predict the coverage profiles for all transcripts in the annotation catalog. Note, however, that this is computationally demanding if the number of transcripts is large. The nCores argument can be increased to predict coverage profiles for several transcripts in parallel.

## Load transcript-to-gene conversion table
tx2gene <- readRDS(system.file("extdata/tx2gene.sub.rds", package = "jcc"))
head(tx2gene)

## Predict transcript coverage profiles
predCovProfiles <- predictTxCoverage(biasModel = biasMod$biasModel, 
                                     exonsByTx = biasMod$exonsByTx, 
                                     bam = bam, tx2gene = tx2gene, 
                                     genome = Hsapiens,
                                     genes = c("ENSG00000070371",
                                               "ENSG00000093010"), 
                                     nCores = 1, verbose = TRUE)

length(predCovProfiles)

The output of predictTxCoverage is a list with one element per reference transcript annotated to any of the selected genes. Each element, in turn, is a list with five components:

predCov: an Rle with the predicted coverage at each position in the transcript.
strand: the strand of the transcript.
junctionCov: a GRanges object with the location and predicted coverage of each junction in the transcript.
aveFragLength: the estimated average fragment length.
note: either 'covOK', 'covError' or 'covNA'. If 'covOK', r Biocpkg("alpine") was able to predict a coverage profile. If 'covError' or 'covNA', the coverage profile could not be predicted, most likely because the transcript is shorter than the fragment length or because no reads in the BAM file overlapped the transcript. In both these cases, we impose a uniform coverage profile for the transcript.

predCovProfiles[2]

Scale transcript coverage profiles

Based on the predicted transcript coverage profiles obtained above, and a set of estimated transcript abundances, we next determine the number of reads predicted to map across each annotated splice junction.

## Read transcript abundance estimates
txQuants <- readRDS(system.file("extdata/quant.sub.rds", package = "jcc"))

## Scale predicted coverages based on the estimated abundances
txsc <- scaleTxCoverages(txCoverageProfiles = predCovProfiles, 
                         txQuants = txQuants, tx2gene = tx2gene,
                         strandSpecific = TRUE, methodName = "Salmon", 
                         verbose = TRUE)

scaleTxCoverages returns a list with two elements:

junctionPredCovs: A data.frame with the predicted coverage for each junction, scaled by the estimated transcript abundances and summarized across all transcripts.
txQuants: A data.frame with estimated transcript abundances, including information about whether or not the coverage profile could be predicted ('covOK') or if a uniform coverage was imposed ('covError' or 'covNA').

names(txsc)

head(txsc$junctionPredCovs)

head(txsc$txQuants)

Combine observed and predicted junction coverages

In order to calculate JCC scores, we need to match the predicted junction coverages obtained above with the observed coverages obtained by aligning the reads to the genome. Assuming the reads were aligned to the genome with STAR, we have a text file (whose name ends with SJ.out.tab) containing the number of reads spanning each annotated junction. The code below shows how to read this file, add column names and combine these observed junction coverages with the predicted ones obtained above. If the reads were aligned to the genome with another aligner than STAR, the observed junction coverages may need to be determined separately and formatted in the appropriate way.

## Read the SJ.out.tab file from STAR, add column names and encode the strand
## information as +/-
jcov <- read.delim(system.file("extdata/sub.SJ.out.tab", package = "jcc"),
                   header = FALSE, as.is = TRUE) %>%
  setNames(c("seqnames", "start", "end", "strand", "motif", "annot",
             "uniqreads", "mmreads", "maxoverhang")) %>% 
  dplyr::mutate(strand = replace(strand, strand == 1, "+")) %>%
  dplyr::mutate(strand = replace(strand, strand == 2, "-")) %>%
  dplyr::select(seqnames, start, end, strand, uniqreads, mmreads) %>%
  dplyr::mutate(seqnames = as.character(seqnames))

head(jcov)

## Combine the observed and predicted junction coverages
combCov <- combineCoverages(junctionCounts = jcov, 
                            junctionPredCovs = txsc$junctionPredCovs,
                            txQuants = txsc$txQuants)

The output of combineCoverages is a list with three elements:

junctionCovs: A data.frame with the observed and predicted coverages for each junction, aggregated across all transcripts. The predCov column contains the predicted number of spanning reads, and the uniqreads column will be used to represent the observed number of junction-spanning reads.
txQuants: A data.frame with estimated transcript abundances, including information about whether or not the coverage profile could be predicted ('covOK') or if a uniform coverage was imposed ('covError' or 'covNA').
geneQuants: A data.frame with estimated gene abundances as well as the total number uniquely mapping and multi-mapping reads spanning any of the junctions in the gene.

names(combCov)

head(combCov$junctionCovs)

head(combCov$txQuants)

head(combCov$geneQuants)

Calculate JCC scores

Finally, we calculate scaled predicted junction coverages and summarize the differences between the observed and predicted junction coverages for each gene by the JCC (junction coverage compatibility) scores.

## Calculate a JCC score for each gene
jcc <- calculateJCCScores(junctionCovs = combCov$junctionCovs, 
                          geneQuants = combCov$geneQuants)

head(jcc$junctionCovs)

head(jcc$geneScores)

calculateJCCScores returns a list with two elements:

junctionCovs: A data.frame with observed and predicted junction coverages, as well as scaled predicted coverages (scaledCov column) and JCC scores for the corresponding gene (jccscore).
geneScores: A data.frame with gene-level abundances, extended with the calculated JCC scores (jccscore column).

Plot observed and predicted junction coverages

To investigate the agreement between the observed and predicted junction coverages for a given gene, we can plot the uniqreads and scaledCov columns from the junction coverage data frame generated in the previous step:

selGene <- "ENSG00000070371"
ggplot2::ggplot(as.data.frame(jcc$junctionCovs) %>%
                  dplyr::filter(gene == selGene), 
                ggplot2::aes(x = uniqreads, y = scaledCov)) + 
  ggplot2::geom_abline(intercept = 0, slope = 1) + 
  ggplot2::geom_point(size = 5) + 
  ggplot2::theme_bw() + 
  ggplot2::xlab("Number of uniquely mapped reads spanning junction") + 
  ggplot2::ylab("Scaled predicted junction coverage") + 
  ggplot2::ggtitle(unique(jcc$junctionCovs %>% 
                            dplyr::filter(gene == selGene) %>%
                            dplyr::pull(methodscore)))

The package also contains a function to generate a multi-panel plot, showing (i) the observed genome coverage profile in the region of the gene, (ii) the estimated transcript abundances, and (iii) the association between the observed and predicted junction coverages.

bwFile <- system.file("extdata/reads.chr22.bw", package = "jcc")
plotGeneSummary(gtf = gtf, junctionCovs = jcc$junctionCovs, 
                useGene = "ENSG00000070371", bwFile = bwFile,
                txQuants = combCov$txQuants, txCol = "transcript_id",
                geneCol = "gene_id", exonCol = "exon_id")