knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)

Introduction

This vignette describes how to use the jcc package to calculate junction coverage compatibility (JCC_ scores [@soneson2018jcc] for a given set of genes. In order to run the entire workflow, the following files are required, and will be used as described below:

```{bash, eval=FALSE} bedtools genomecov -split -ibam -bg > bedGraphToBigWig

where you should replace the text in `<>` by the appropriate file names. 
The `<chrNameLength.txt>` is a text file with the length of each chromosome. 
If you created a genome index with STAR, the index directory contains such a
file.
The information can also be extracted from the header of the BAM file. 

The workflow has mainly been tested with reference files from Ensembl, and may
need to be adjusted to work with reference files in other formats.

The general procedure for estimating junction coverage compatibility (JCC)
scores for a set of genes is outlined in the figure below.
Please refer to the paper [@soneson2018jcc] for a more detailed description and
motivation of the different steps.

<img src="study_design.png" alt="drawing" width="400px"/>

In the rest of this vignette, we will illustrate each step using a small example
data set provided with the package.

# Fit fragment bias model

The first step in the process is to fit a fragment bias model using the 
`r Biocpkg("alpine")` package.
The `fitAlpineBiasModel()` function provides a wrapper containing the necessary
steps.
Note that the parameters of this function may need to be adapted to be
appropriate for a specific data set. 
In particular, since the example data set only contains a small number of reads
and genes, we set `minCount` and `minLength` to small values to make sure that
some genes will be retained and used to fit the fragment bias model.
However, in a real data set these arguments can often be set to higher values. 
Also, the `BSgenome` object (here, `Hsapiens` from the
`BSgenome.Hsapiens.NCBI.GRCh38` package) should be replaced with the appropriate
object for your organism.
We refer to the help file and the `r Biocpkg("alpine")` package for more
detailed explanations of each parameter.

```r
## Load necessary packages
suppressPackageStartupMessages({
  library(jcc)
  library(dplyr)
  library(BSgenome.Hsapiens.NCBI.GRCh38)
})
## Load gtf file and bam file with aligned reads
gtf <- system.file("extdata/Homo_sapiens.GRCh38.90.chr22.gtf.gz", 
                   package = "jcc")
bam <- system.file("extdata/reads.chr22.bam", package = "jcc")

## Fit fragment bias model
biasMod <- fitAlpineBiasModel(gtf = gtf, bam = bam, organism = "Homo_sapiens",
                              genome = Hsapiens, genomeVersion = "GRCh38",
                              version = 90, minLength = 230, maxLength = 7000,
                              minCount = 10, maxCount = 10000, 
                              subsample = TRUE, nbrSubsample = 30, seed = 1,
                              minSize = NULL, maxSize = 220, 
                              verbose = TRUE)

The output of fitAlpineBiasModel is a list with three objects:

names(biasMod)

biasMod$exonsByTx

biasMod$transcripts

Predict transcript coverage profiles

Next, we use the fitted fragment bias model to predict the coverage profile for each transcript in a set of specified genes. The genes argument can be set to NULL to predict the coverage profiles for all transcripts in the annotation catalog. Note, however, that this is computationally demanding if the number of transcripts is large. The nCores argument can be increased to predict coverage profiles for several transcripts in parallel.

## Load transcript-to-gene conversion table
tx2gene <- readRDS(system.file("extdata/tx2gene.sub.rds", package = "jcc"))
head(tx2gene)

## Predict transcript coverage profiles
predCovProfiles <- predictTxCoverage(biasModel = biasMod$biasModel, 
                                     exonsByTx = biasMod$exonsByTx, 
                                     bam = bam, tx2gene = tx2gene, 
                                     genome = Hsapiens,
                                     genes = c("ENSG00000070371",
                                               "ENSG00000093010"), 
                                     nCores = 1, verbose = TRUE)

length(predCovProfiles)

The output of predictTxCoverage is a list with one element per reference transcript annotated to any of the selected genes. Each element, in turn, is a list with five components:

predCovProfiles[2]

Scale transcript coverage profiles

Based on the predicted transcript coverage profiles obtained above, and a set of estimated transcript abundances, we next determine the number of reads predicted to map across each annotated splice junction.

## Read transcript abundance estimates
txQuants <- readRDS(system.file("extdata/quant.sub.rds", package = "jcc"))

## Scale predicted coverages based on the estimated abundances
txsc <- scaleTxCoverages(txCoverageProfiles = predCovProfiles, 
                         txQuants = txQuants, tx2gene = tx2gene,
                         strandSpecific = TRUE, methodName = "Salmon", 
                         verbose = TRUE)

scaleTxCoverages returns a list with two elements:

names(txsc)

head(txsc$junctionPredCovs)

head(txsc$txQuants)

Combine observed and predicted junction coverages

In order to calculate JCC scores, we need to match the predicted junction coverages obtained above with the observed coverages obtained by aligning the reads to the genome. Assuming the reads were aligned to the genome with STAR, we have a text file (whose name ends with SJ.out.tab) containing the number of reads spanning each annotated junction. The code below shows how to read this file, add column names and combine these observed junction coverages with the predicted ones obtained above. If the reads were aligned to the genome with another aligner than STAR, the observed junction coverages may need to be determined separately and formatted in the appropriate way.

## Read the SJ.out.tab file from STAR, add column names and encode the strand
## information as +/-
jcov <- read.delim(system.file("extdata/sub.SJ.out.tab", package = "jcc"),
                   header = FALSE, as.is = TRUE) %>%
  setNames(c("seqnames", "start", "end", "strand", "motif", "annot",
             "uniqreads", "mmreads", "maxoverhang")) %>% 
  dplyr::mutate(strand = replace(strand, strand == 1, "+")) %>%
  dplyr::mutate(strand = replace(strand, strand == 2, "-")) %>%
  dplyr::select(seqnames, start, end, strand, uniqreads, mmreads) %>%
  dplyr::mutate(seqnames = as.character(seqnames))

head(jcov)

## Combine the observed and predicted junction coverages
combCov <- combineCoverages(junctionCounts = jcov, 
                            junctionPredCovs = txsc$junctionPredCovs,
                            txQuants = txsc$txQuants)

The output of combineCoverages is a list with three elements:

names(combCov)

head(combCov$junctionCovs)

head(combCov$txQuants)

head(combCov$geneQuants)

Calculate JCC scores

Finally, we calculate scaled predicted junction coverages and summarize the differences between the observed and predicted junction coverages for each gene by the JCC (junction coverage compatibility) scores.

## Calculate a JCC score for each gene
jcc <- calculateJCCScores(junctionCovs = combCov$junctionCovs, 
                          geneQuants = combCov$geneQuants)

head(jcc$junctionCovs)

head(jcc$geneScores)

calculateJCCScores returns a list with two elements:

Plot observed and predicted junction coverages

To investigate the agreement between the observed and predicted junction coverages for a given gene, we can plot the uniqreads and scaledCov columns from the junction coverage data frame generated in the previous step:

selGene <- "ENSG00000070371"
ggplot2::ggplot(as.data.frame(jcc$junctionCovs) %>%
                  dplyr::filter(gene == selGene), 
                ggplot2::aes(x = uniqreads, y = scaledCov)) + 
  ggplot2::geom_abline(intercept = 0, slope = 1) + 
  ggplot2::geom_point(size = 5) + 
  ggplot2::theme_bw() + 
  ggplot2::xlab("Number of uniquely mapped reads spanning junction") + 
  ggplot2::ylab("Scaled predicted junction coverage") + 
  ggplot2::ggtitle(unique(jcc$junctionCovs %>% 
                            dplyr::filter(gene == selGene) %>%
                            dplyr::pull(methodscore)))

The package also contains a function to generate a multi-panel plot, showing (i) the observed genome coverage profile in the region of the gene, (ii) the estimated transcript abundances, and (iii) the association between the observed and predicted junction coverages.

bwFile <- system.file("extdata/reads.chr22.bw", package = "jcc")
plotGeneSummary(gtf = gtf, junctionCovs = jcc$junctionCovs, 
                useGene = "ENSG00000070371", bwFile = bwFile,
                txQuants = combCov$txQuants, txCol = "transcript_id",
                geneCol = "gene_id", exonCol = "exon_id")

Session info

sessionInfo()

References



csoneson/jcc documentation built on Dec. 25, 2024, 2:32 a.m.