library(snapcount) ## Track time spent on making the vignette startTime <- Sys.time()
snapcount is directed at current and potential r Biocpkg('recount')
users who want the further ability to quickly filter RNA-seq derived coverage data to just the regions of the human genome and the samples they care about.
This avoids having to download a whole study's worth of data across the entire genome.
While it offers the same datasets that r Biocpkg('recount')
contains, it enables querying via a particular gene or genomic coordinate range entered by the user and only downloading the results from that query.
It then builds a RangedSummarizedExperiment (RSE) object on the fly, based on just the rows/columns in those results.
This query functionality is supported for genes, exons, and exon-exon junctions.
In addition, a set of aggregation functions are available for exon-exon junction data to allow for further filtering and summarizing.
R
is an open-source statistical environment which can be easily modified to enhance its functionality via packages.
R
can be installed on any operating system from CRAN after which you can install snapcount
by using the following commands in your R
session:
if (!requireNamespace("BiocManager", quietly = TRUE)) install.packages("BiocManager") BiocManager::install("snapcount") ## Check that you have a valid Bioconductor installation BiocManager::valid()
snapcount is based on other packages and in particular in those that have implemented the infrastructure needed for dealing with RNA-seq data. That is, packages like r Biocpkg('GenomicFeatures')
and r Biocpkg('recount')
that allow you to import the data. A snapcount user is not expected to deal with those packages directly but will need to be familiar with r Biocpkg('SummarizedExperiment')
to understand the results snapcount generates. This vignette is based partially on the recount2 quick start guide.
If you are asking yourself the question "Where do I start using Bioconductor?" you might be interested in this blog post.
As package developers, we try to explain clearly how to use our packages and in which order to use the functions. But R
and Bioconductor
have a steep learning curve so it is critical to learn where to ask for help. The blog post quoted above mentions some but we would like to highlight the Bioconductor support site as the main resource for getting help: remember to use the snapcount
tag and check the older posts. Other alternatives are available such as creating GitHub issues and tweeting. However, please note that if you want to receive help you should adhere to the posting guidelines. It is particularly critical that you provide a small reproducible example and your session information so package developers can track down the source of the error.
snapcount is an interface to the Snaptron webservices and data. For questions regarding Snaptron itself and/or its data, please use the Gitter channel.
We hope that snapcount will be useful for your research. Please use the following information to cite the package specifically, thank you!
Christopher Wilks, Rone Charles, Ben Langmead (2019). snapcount: R/Bioconductor Package for interfacing with Snaptron for rapid quering of expression counts. https://github.com/langmead-lab/snapcount
The Snaptron project itself is published separately and can be cited with the following:
Wilks, C, Gaddipati, P, Nellore, A, Langmead, B (2018). Snaptron: querying splicing patterns across tens of thousands of RNA-seq samples. Bioinformatics, 34, 1:114-116.
snapcount makes it easy to query the Snaptron web services, with results presented as r Biocpkg('RangedSummarizedExperiment')
objects.
You can query measurements for genes, exons, splice junctions and coverage vectors from the RNA-seq samples indexed in Snaptron.
Samples are organized into compilations (e.g. srav2
) that altogether contain the same studies and summaries in the r Biocpkg('recount')
resource. Queries can be filtered to narrow the focus to particular genes or genomic intervals, to events with certain prevalence, to events that do or don't appear in gene annotation, or to samples with particular metadata.
snapcount complements the r Biocpkg('recount')
package, which also allows for searching by coordinates or by HUGO gene names.
In general, r Biocpkg('recount')
works best when you are interested in all the genes, exons, or splice junctions in a study, whereas snapcount is best for queries over a particular subset of genes or intervals across all or most of the samples in recount2/Snaptron. The more specific your query, the faster and easier it will be to use snapcount.
All the RNA-seq samples in the recount and Snaptron resources were analyzed using the Rail-RNA aligner. Rail produces spliced alignments that in turn produce coverage and junction-level summaries that are further processed to obtain the data in recount and Snaptron.
Similar to the r Biocpkg('recount')
all coverage counts in Snaptron/snapcount are stored/retrieved as raw, un-normalized counts.
A basic query returns a RangedSummarizedExperiment (RSE) object with one or more features (genes, exons, or splice junctions) as rowRanges. Raw coverage counts are returned as the counts assay in the RSE Full sample metadata is returned as the colData of the RSE
Basic queries include:
The Gencode v25 annotation defines what genes and exons can be queried (as in r Biocpkg('recount')
).
For splice junctions, both annotated and novel junctions are queried at the same time unless one is explicitly filtered.
Metadata columns will vary by compilation (e.g. TCGA vs. GTEx).
Please be cautious with metadata. We strive to make it as complete and usable as possible, but it can still be incomplete, incorrect or inconsistently formatted (e.g. "age" in the srav2
compilation).
Start by querying for all junctions within the region of gene CD99:
##Query coverage for gene, exon, and annotated junctions across all #in the region of the CD99 gene #from GTEx v6 sample compilation #CD99 is chosen for its size sb <- QueryBuilder(compilation="gtex", regions="CD99") cd99.gene <- query_gene(sb) dim(cd99.gene) head(cd99.gene)
Now query for junctions:
##Query all exon-exon splice junctions within the region of gene CD99 cd99.jx.all <- query_jx(sb) dim(cd99.jx.all) cd99.jx.all
Now query for junctions and filter by sample type: Brain:
#now subfilter by sample tissue #GTEx samples that are labeled with tissue type "Brain" sb <- set_column_filters(sb, SMTS == "Brain") cd99.jx.all <- query_jx(sb) dim(cd99.jx.all) head(cd99.jx.all)
Same query/filter again but for exons:
cd99.exon <- query_exon(sb) dim(cd99.exon) head(cd99.exon)
Now query junctions but further filter for only those which are fully annotated:
###Only query junctions which are fully annotated---both left and #right splice sites are found together in one or more of the #Snaptron sourced annotations sb <- set_row_filters(sb, annotated == 1) cd99.jx <- query_jx(sb) dim(cd99.jx) head(cd99.jx)
Now check an example of the metadata stored in the RSE, InsertSize:
##Metadata is stored directly in the RSE object. #For example the library insert size can be retrieved #across all runs in the RSE head(cd99.jx.all$InsertSize)
"High-level" queries answer more complex questions about splicing patterns. Results of high level queries are data frames unless otherwise specified.
PSI measures how often an exon is included in the surrounding transcript.
This query considers only the common case
where a cassette exon has two inclusion junctions (junctions that join the exon to its up- and downstream neighbors) and one exclusion junction (that goes between the neighbors while skipping the exon).
Normal PSI values range from 0 (never included) to 1 (always included).
The value -1 is returned for a
sample where either one or the other inclusion groups had 0 coverage or the total raw coverage
across groups was less than a specified minimum count (min_count
defaults to 20).
#Build new query against GTEx #left inclusion query lq <- QueryBuilder(compilation="gtex", regions="chr19:45297955-45298142") lq <- set_row_filters(lq, strand == "+") lq <- set_coordinate_modifier(lq, Coordinates$Exact) #right inclusion query rq <- QueryBuilder(compilation="gtex", regions="chr19:45298223-45299810") rq <- set_row_filters(rq, strand == "+") rq <- set_coordinate_modifier(rq, Coordinates$Exact) #exclusion query ex <- QueryBuilder(compilation="gtex", regions="chr19:45297955-45299810") ex <- set_row_filters(ex, strand == "+") ex <- set_coordinate_modifier(ex, Coordinates$Exact) psi <- percent_spliced_in(list(lq), list(rq), list(ex)) #order by psi descending psi <- psi[order(-psi),] head(psi)
Junction Inclusion Ratio (JIR) measures the relative prevalence of two splicing patterns. It is similar to, but more general than the PSI query. The query defines two sets of junctions to be compared. The sets may or may not have some junctions in common. JIR ranges from -1.0 to 1.0, where 0 says that the total number of reads supporting the junctions in set A equals the number of reads supporting set-B junctions. JIR approaches 1 when there are many more reads supporting the set-B junctions compared to the set-A junctions. JIR approaches -1 when set-A junctions have many more supporting reads than set-B junctions. The list of results is presented in descending order by JIR.
#groupA A <- QueryBuilder(compilation="srav2", regions="chr2:29446395-30142858") A <- set_row_filters(A, strand == "-") A <- set_coordinate_modifier(A, Coordinates$Within) #groupB B <- QueryBuilder(compilation="srav2", regions="chr2:29416789-29446394") B <- set_row_filters(B, strand == "-") B <- set_coordinate_modifier(B, Coordinates$Within) jir <- junction_inclusion_ratio(list(A), list(B), group_names=c("groupA","groupB")) head(jir)
SSC reports the number of samples where one or more sets of junctions co-occur. We say two junctions co-occur in a sample of they both have at least one supporting read in that sample. The junctions in the set might describe a particular alternative splicing event, but they do not have to. The sets of junctions are defined using basic queries.
## We define the left/right splice junction supports of 3 #cassette exons for the following 3 genes: ## "left"/"right" is relative to the forward strand, #reference genome coordinates, #we use this terminology instead of 5' or 3' #since the gene may be on the reverse strand. ###GNB1 GNB1l <- QueryBuilder(compilation="gtex", regions="chr1:1879786-1879786") GNB1l <- set_row_filters(GNB1l, strand == "-") GNB1l <- set_coordinate_modifier(GNB1l, Coordinates$EndIsExactOrWithin) GNB1r <- QueryBuilder(compilation="gtex", regions="chr1:1879903-1879903") GNB1r <- set_row_filters(GNB1r, strand == "-") GNB1r <- set_coordinate_modifier(GNB1r, Coordinates$StartIsExactOrWithin) ###PIK3CD PIK3l <- QueryBuilder(compilation="gtex", regions="chr1:9664595-9664595") PIK3l <- set_row_filters(PIK3l, strand == "+") PIK3l <- set_coordinate_modifier(PIK3l, Coordinates$EndIsExactOrWithin) PIK3r <- QueryBuilder(compilation="gtex", regions="chr1:9664759-9664759") PIK3r <- set_row_filters(PIK3r, strand == "+") PIK3r <- set_coordinate_modifier(PIK3r, Coordinates$StartIsExactOrWithin) ###TAP2 T2l <- QueryBuilder(compilation="gtex", regions="chr6:32831148-32831148") T2l <- set_row_filters(T2l, strand == "-") T2l <- set_coordinate_modifier(T2l, Coordinates$EndIsExactOrWithin) T2r <- QueryBuilder(compilation="gtex", regions="chr6:32831182-32831182") T2r <- set_row_filters(T2r, strand == "-") T2r <- set_coordinate_modifier(T2r, Coordinates$StartIsExactOrWithin) ssc_func <- shared_sample_counts ssc <- ssc_func(list(GNB1l, GNB1r), list(PIK3l, PIK3r), list(T2l, T2r), group_names=c("validated","not validated","validated")) ssc
This query asks if a particular set of junctions seems to be associated with tissue type.
This query defaults to using the gtex
compilation, but it could work with any compilation having a tissue
metadata field (e.g. tcga
).
The main output is a list of all the samples in the compilation with either a 0 or a 1 in the shared column. 1 indicates that a junction from the query occurs in that sample (tissue type).
This output can be passed to a statistical test such as an F-test to assess the significance of tissue enrichment.
A <- QueryBuilder(compilation="gtex", regions="chr4:20763023-20763023") A <- set_row_filters(A, strand == "-") A <- set_coordinate_modifier(A, Coordinates$EndIsExactOrWithin) ts <- tissue_specificity(list(A)) head(ts)
The TS function also supports more than one set of junctions.
The main output is a list of all the samples in the compilation with either a 0 or a 1 in the shared column. 1 here indicates that at least one junction from all sets occurred in that sample, i.e. the sample is shared across the results of all the junction queries.
A use case of the 2-set TS function is to measure the significance of tissue enrichment across junctions which support a cassette exon.
A <- QueryBuilder(compilation="gtex", regions="chr4:20763023-20763023") A <- set_row_filters(A, strand == "-") A <- set_coordinate_modifier(A, Coordinates$EndIsExactOrWithin) B <- QueryBuilder(compilation="gtex", regions="chr4:20763098-20763098") B <- set_row_filters(B, strand == "-") B <- set_coordinate_modifier(B, Coordinates$StartIsExactOrWithin) ts <- tissue_specificity(list(A, B)) head(ts)
The junction union query provides the ability to get a single RSE object of non-redundant junctions across multiple compilations which share a common reference (e.g. HG38). Junctions which are the same (have the same chromosome coordinates and strand) are merged into a single row.
An example would be search for the same junctions across the human HG38-based compilations, GTEx and TCGA. This can be used to efficiently leverage GTEx as a substitute "normal" control, while searching for splicing phenomena specific to the tumor samples in TCGA.
In the following example we query for all junctions that have a specific 5' (donor) coordinate in both compilations, merging the results. Note, this query, due to the number of samples involved (approximately 18,000) takes several seconds to complete.
gtex <- QueryBuilder(compilation="gtex", regions="chr1:1879786-1879786") gtex <- set_row_filters(gtex, strand == "-") gtex <- set_coordinate_modifier(gtex, Coordinates$EndIsExactOrWithin) tcga <- QueryBuilder(compilation="tcga", regions="chr1:1879786-1879786") tcga <- set_row_filters(tcga, strand == "-") tcga <- set_coordinate_modifier(tcga, Coordinates$EndIsExactOrWithin) gtex_tcga_union <- junction_union(gtex,tcga) dim(gtex_tcga_union) head(gtex_tcga_union)
The junction intersection query provides similar functionality but restricting the output RSE object to only those junctions found in both compilations.
Using the same scenario as above, but this time the result set will be reduced.
gtex <- QueryBuilder(compilation="gtex", regions="chr1:1879786-1879786") gtex <- set_row_filters(gtex, strand == "-") gtex <- set_coordinate_modifier(gtex, Coordinates$EndIsExactOrWithin) tcga <- QueryBuilder(compilation="tcga", regions="chr1:1879786-1879786") tcga <- set_row_filters(tcga, strand == "-") tcga <- set_coordinate_modifier(tcga, Coordinates$EndIsExactOrWithin) gtex_tcga_intersection <- junction_intersection(gtex,tcga) dim(gtex_tcga_intersection) head(gtex_tcga_intersection)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.