knitr::opts_chunk$set( collapse = TRUE, comment = "#>" )
This vignette describes how to use the jcc
package to calculate junction
coverage compatibility (JCC_ scores [@soneson2018jcc] for a given set of genes.
In order to run the entire workflow, the following files are required, and will
be used as described below:
r
Biocpkg("alpine")
[@love2016alpine] as well as to obtain the predicted
transcript coverage profiles.data.frame
with the number of uniquely mapping and multimapping reads
aligned across each annotated splice junction.
Below we show how to generate a correctly formatted data frame from the
SJ.out.tab
text file generated by STAR.BSgenome
object for the genome that the reads in the BAM file were aligned
to.gtf
file corresponding to the BSgenome
and the BAM file.data.frame
with at least two columns, named
tx
and gene
. Additional columns providing, e.g., gene symbols can be
included.data.frame
with estimated transcript abundances, with three columns named
transcript
, TPM
and count
.```{bash, eval=FALSE}
bedtools genomecov -split -ibam
where you should replace the text in `<>` by the appropriate file names. The `<chrNameLength.txt>` is a text file with the length of each chromosome. If you created a genome index with STAR, the index directory contains such a file. The information can also be extracted from the header of the BAM file. The workflow has mainly been tested with reference files from Ensembl, and may need to be adjusted to work with reference files in other formats. The general procedure for estimating junction coverage compatibility (JCC) scores for a set of genes is outlined in the figure below. Please refer to the paper [@soneson2018jcc] for a more detailed description and motivation of the different steps. <img src="study_design.png" alt="drawing" width="400px"/> In the rest of this vignette, we will illustrate each step using a small example data set provided with the package. # Fit fragment bias model The first step in the process is to fit a fragment bias model using the `r Biocpkg("alpine")` package. The `fitAlpineBiasModel()` function provides a wrapper containing the necessary steps. Note that the parameters of this function may need to be adapted to be appropriate for a specific data set. In particular, since the example data set only contains a small number of reads and genes, we set `minCount` and `minLength` to small values to make sure that some genes will be retained and used to fit the fragment bias model. However, in a real data set these arguments can often be set to higher values. Also, the `BSgenome` object (here, `Hsapiens` from the `BSgenome.Hsapiens.NCBI.GRCh38` package) should be replaced with the appropriate object for your organism. We refer to the help file and the `r Biocpkg("alpine")` package for more detailed explanations of each parameter. ```r ## Load necessary packages suppressPackageStartupMessages({ library(jcc) library(dplyr) library(BSgenome.Hsapiens.NCBI.GRCh38) })
## Load gtf file and bam file with aligned reads gtf <- system.file("extdata/Homo_sapiens.GRCh38.90.chr22.gtf.gz", package = "jcc") bam <- system.file("extdata/reads.chr22.bam", package = "jcc") ## Fit fragment bias model biasMod <- fitAlpineBiasModel(gtf = gtf, bam = bam, organism = "Homo_sapiens", genome = Hsapiens, genomeVersion = "GRCh38", version = 90, minLength = 230, maxLength = 7000, minCount = 10, maxCount = 10000, subsample = TRUE, nbrSubsample = 30, seed = 1, minSize = NULL, maxSize = 220, verbose = TRUE)
The output of fitAlpineBiasModel
is a list with three objects:
biasModel
: the fitted fragment bias model from r Biocpkg("alpine")
.exonsByTx
: a GRangesList
, grouping exons by transcript.transcripts
: a GRanges
object with all reference transcripts. names(biasMod) biasMod$exonsByTx biasMod$transcripts
Next, we use the fitted fragment bias model to predict the coverage profile for
each transcript in a set of specified genes.
The genes
argument can be set to NULL
to predict the coverage profiles for
all transcripts in the annotation catalog.
Note, however, that this is computationally demanding if the number of
transcripts is large.
The nCores
argument can be increased to predict coverage profiles for several
transcripts in parallel.
## Load transcript-to-gene conversion table tx2gene <- readRDS(system.file("extdata/tx2gene.sub.rds", package = "jcc")) head(tx2gene) ## Predict transcript coverage profiles predCovProfiles <- predictTxCoverage(biasModel = biasMod$biasModel, exonsByTx = biasMod$exonsByTx, bam = bam, tx2gene = tx2gene, genome = Hsapiens, genes = c("ENSG00000070371", "ENSG00000093010"), nCores = 1, verbose = TRUE) length(predCovProfiles)
The output of predictTxCoverage
is a list with one element per reference
transcript annotated to any of the selected genes.
Each element, in turn, is a list with five components:
predCov
: an Rle
with the predicted coverage at each position in the
transcript.strand
: the strand of the transcript.junctionCov
: a GRanges
object with the location and predicted coverage of
each junction in the transcript.aveFragLength
: the estimated average fragment length.note
: either 'covOK', 'covError' or 'covNA'.
If 'covOK', r Biocpkg("alpine")
was able to predict a coverage profile.
If 'covError' or 'covNA', the coverage profile could not be predicted, most
likely because the transcript is shorter than the fragment length or because no
reads in the BAM file overlapped the transcript.
In both these cases, we impose a uniform coverage profile for the transcript. predCovProfiles[2]
Based on the predicted transcript coverage profiles obtained above, and a set of estimated transcript abundances, we next determine the number of reads predicted to map across each annotated splice junction.
## Read transcript abundance estimates txQuants <- readRDS(system.file("extdata/quant.sub.rds", package = "jcc")) ## Scale predicted coverages based on the estimated abundances txsc <- scaleTxCoverages(txCoverageProfiles = predCovProfiles, txQuants = txQuants, tx2gene = tx2gene, strandSpecific = TRUE, methodName = "Salmon", verbose = TRUE)
scaleTxCoverages
returns a list with two elements:
junctionPredCovs
: A data.frame
with the predicted coverage for each
junction, scaled by the estimated transcript abundances and summarized across
all transcripts.txQuants
: A data.frame
with estimated transcript abundances, including
information about whether or not the coverage profile could be predicted
('covOK') or if a uniform coverage was imposed ('covError' or 'covNA').names(txsc) head(txsc$junctionPredCovs) head(txsc$txQuants)
In order to calculate JCC scores, we need to match the predicted junction
coverages obtained above with the observed coverages obtained by aligning the
reads to the genome.
Assuming the reads were aligned to the genome with STAR, we have a text file
(whose name ends with SJ.out.tab
) containing the number of reads spanning each
annotated junction.
The code below shows how to read this file, add column names and combine these
observed junction coverages with the predicted ones obtained above.
If the reads were aligned to the genome with another aligner than STAR, the
observed junction coverages may need to be determined separately and formatted
in the appropriate way.
## Read the SJ.out.tab file from STAR, add column names and encode the strand ## information as +/- jcov <- read.delim(system.file("extdata/sub.SJ.out.tab", package = "jcc"), header = FALSE, as.is = TRUE) %>% setNames(c("seqnames", "start", "end", "strand", "motif", "annot", "uniqreads", "mmreads", "maxoverhang")) %>% dplyr::mutate(strand = replace(strand, strand == 1, "+")) %>% dplyr::mutate(strand = replace(strand, strand == 2, "-")) %>% dplyr::select(seqnames, start, end, strand, uniqreads, mmreads) %>% dplyr::mutate(seqnames = as.character(seqnames)) head(jcov) ## Combine the observed and predicted junction coverages combCov <- combineCoverages(junctionCounts = jcov, junctionPredCovs = txsc$junctionPredCovs, txQuants = txsc$txQuants)
The output of combineCoverages
is a list with three elements:
junctionCovs
: A data.frame
with the observed and predicted coverages for
each junction, aggregated across all transcripts. The predCov
column contains
the predicted number of spanning reads, and the uniqreads
column will be used
to represent the observed number of junction-spanning reads.txQuants
: A data.frame
with estimated transcript abundances, including
information about whether or not the coverage profile could be predicted
('covOK') or if a uniform coverage was imposed ('covError' or 'covNA').geneQuants
: A data.frame
with estimated gene abundances as well as the
total number uniquely mapping and multi-mapping reads spanning any of the
junctions in the gene.names(combCov) head(combCov$junctionCovs) head(combCov$txQuants) head(combCov$geneQuants)
Finally, we calculate scaled predicted junction coverages and summarize the differences between the observed and predicted junction coverages for each gene by the JCC (junction coverage compatibility) scores.
## Calculate a JCC score for each gene jcc <- calculateJCCScores(junctionCovs = combCov$junctionCovs, geneQuants = combCov$geneQuants) head(jcc$junctionCovs) head(jcc$geneScores)
calculateJCCScores
returns a list with two elements:
junctionCovs
: A data.frame
with observed and predicted junction coverages,
as well as scaled predicted coverages (scaledCov
column) and JCC scores for
the corresponding gene (jccscore
).geneScores
: A data.frame
with gene-level abundances, extended with the
calculated JCC scores (jccscore
column).To investigate the agreement between the observed and predicted junction
coverages for a given gene, we can plot the uniqreads
and scaledCov
columns
from the junction coverage data frame generated in the previous step:
selGene <- "ENSG00000070371" ggplot2::ggplot(as.data.frame(jcc$junctionCovs) %>% dplyr::filter(gene == selGene), ggplot2::aes(x = uniqreads, y = scaledCov)) + ggplot2::geom_abline(intercept = 0, slope = 1) + ggplot2::geom_point(size = 5) + ggplot2::theme_bw() + ggplot2::xlab("Number of uniquely mapped reads spanning junction") + ggplot2::ylab("Scaled predicted junction coverage") + ggplot2::ggtitle(unique(jcc$junctionCovs %>% dplyr::filter(gene == selGene) %>% dplyr::pull(methodscore)))
The package also contains a function to generate a multi-panel plot, showing (i) the observed genome coverage profile in the region of the gene, (ii) the estimated transcript abundances, and (iii) the association between the observed and predicted junction coverages.
bwFile <- system.file("extdata/reads.chr22.bw", package = "jcc") plotGeneSummary(gtf = gtf, junctionCovs = jcc$junctionCovs, useGene = "ENSG00000070371", bwFile = bwFile, txQuants = combCov$txQuants, txCol = "transcript_id", geneCol = "gene_id", exonCol = "exon_id")
sessionInfo()
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.