knitr::opts_chunk$set( collapse = TRUE, comment = "#>" )
The R package splice2neo provides multiple functions for the analysis of splice junctions. The main focus is on the association of splice junctions with somatic mutations. The package integrates the output of several tools that predict splicing effects from mutations. It annotates these mutation effects with the resulting splicing event (alternative 3’/5’ splice sites, exons skipping, intron retentions) and the resulting splice junctions in a standardized format. Expressed splice junctions detected by external tools from RNA-seq data can be parsed and integrated. Splice junctions can be filtered against canonical splice junctions and annotated with the altered transcript sequences, CDS, and resulting peptide sequences. Splice2neo was mainly developed to support the detection of neoantigen candidates from tumor-specific splice junctions. Neoantigens are tumor-specific mutated peptides recognized by T-cells and utilized in immunotherapies against tumors [@lang_identification_2022]. However, most functions from the splice2neo package are also useful for other types of splicing analysis.
In this vignette, we use use some toy example data to demonstrate how splice2neo can be used to analyze splice-junctions which are derived from mutation predictions as tumor-specific neoantigen candidates [@lang_prediction_2023].
The R package splice2neo is not yet on CRAN or Bioconductor. Therefore, you need to install splice2neo from github.
if (!requireNamespace("remotes", quietly = TRUE)) { install.packages("remotes") } remotes::install_github("TRON-Bioinformatics/splice2neo")
Integrating splice2neo functions and detection rules based on splice effect scores and RNA-seq support facilitates the identification of mutation-associated splice junctions which are tumor-specific and can encode neoantigen candidates. The following workflow explains how to use splice2neo to predict neoantigen candidates from mutation-associated splice junctions using example data.
library(splice2neo) library(dplyr)
A database of transcripts is required for transcript annotation with splice2neo. Here, we show how to build the database from a GTF file that is a subset of GENCODE annotations, but users can build these databases also from other resources.
# use gtf file of choice and transform into transcript database gtf_file <- system.file("extdata/example/gencode.v34lift37.annotation.subset.gtf.gz", package="splice2neo") # parse GTF file as txdb object txdb <- GenomicFeatures::makeTxDbFromGFF(gtf_file)
# optionally save database for later re-use # saveDb(txdb, file = "/path/to/transripts/txdb.sqlite")
# Build transcript database transcripts <- GenomicFeatures::exonsBy(txdb, by = c("tx"), use.names = TRUE) transcripts_gr <- GenomicFeatures::transcripts(txdb, columns = c("gene_id", "tx_id", "tx_name")) # Build cds database cds <- GenomicFeatures::cdsBy(txdb, by = c("tx"), use.name = TRUE)
We can retrieve the canonical splice junctions between all adjacent exons from a list of transcripts with canonical_junctions()
.
ref_junc <- canonical_junctions(transcripts) str(ref_junc)
In splice2neo, splice junctions are represented by a junc_id
, which is defined by the genomic coordinates of the exon positions (1-based) that are combined through the junction and the transcription strand of the transcript: chr:pos1-pos2:strand
.
This format allows direct conversion to GRangs
objects from the GenomicRanges package.
gr <- GenomicRanges::GRanges(ref_junc) gr
A given GRanges object with ranges of splice junctions can be converted into a vector of `junc_id``s:
junc_id <- as.character(gr) head(junc_id)
Splice2neo transforms splice events identified in RNA-seq into a standardized junction format. We show here how to parse the tools LeafCutter and SplAdder.
The output of the tool LeafCutter [@li_annotation-free_2018] can be transformed using leafcutter_transform()
:
leafcutter_path <- system.file("extdata/example/leafcutter", package="splice2neo") leafcutter <- leafcutter_transform(path = leafcutter_path) head(leafcutter)
The output of the tool SplAdder [@kahles_spladder_2016] can be transformed using spladder_transform()
:
spladder_path <- system.file("extdata/example/spladder", package="splice2neo") spladder <- spladder_transform(path = spladder_path) head(spladder)
Now we want to get a single data.frame
of all splice junctions identified by RNA-seq tools and information by which tool they were identified.
rna_junctions <- generate_combined_dataset(list("spladder_juncs" = spladder, "leafcutter_juncs" = leafcutter)) head(rna_junctions)
Deep learning tools can predict the effect of a mutation on the disruption or creation of canonical splicing sequence motifs. This is typically provided as gain or loss of splicing acceptor or donor sites. Splice2neo can transform and annotate such mutation effects into a unified splice junction format.
Currently, the transformation of predictions by SpliceAI [@jaganathan_predicting_2019], MMSplice[@cheng_mmsplice_2019] and Pangolin[@zeng_predicting_2022] are supported.
Here, we show how to annotate results from SpliceAI and MMSplice. The annotation of Pangolin results works analogously with SpliceAI.
Effect predictions by SpliceAI can be read with parse_spliceai()
spliceai_file <- system.file("extdata/example/spliceAI.vcf", package="splice2neo") # parse SpliceAI results spliceai <- parse_spliceai(spliceai_file) head(spliceai)
Next, SpliceAi effects can be transformed into a standardized effect dataframe with format_spliceai()
. Here, we can optionally provide a data.frame
(gene_table
) which relates gene symbols with GENCODE gene ids. If gene_table
is provided, the formatted data.frame
contains the GENCODE gene ids. While this is advised to better select relevant transcripts, it is not required.
gene_table <- readr::read_tsv(system.file("extdata/example/gene_table.tsv", package = "splice2neo")) head(gene_table)
# format SpliceAI results splicai_formatted <- format_spliceai(spliceai, gene_table = gene_table) head(splicai_formatted)
Predicted SpliceAI effects can be annotated with all potential resulting junctions and transcripts with annotate_mut_effect()
. If the input was formatted using gene_table
, the user may want to choose gene_mapping = TRUE
which removes unlikely events from the results. Otherwise, the default gene_mapping = FALSE
should be used.
The resulting dataframe contains all potential alternative transcripts per row, meaning that there can be multiple annotated transcripts per junction and also multiple junctions per mutation effect. At this point, no filtering based on effect size has been applied meaning that the dataframe contains unlikely events.
# annotate SpliceAI junctions with all potential junctions and transcripts spliceai_annotated <- annotate_mut_effect( effect_df = splicai_formatted, transcripts = transcripts, transcripts_gr = transcripts_gr, gene_mapping = TRUE ) spliceai_annotated %>% dplyr::select(mut_id, junc_id, gene_id, tx_id, effect, score) %>% head()
It can happen that multiple effects lead to the same altered transcripts (defined by mut_id
, junc_id
, tx_id
). The function unique_mut_junc()
can be used to filter for unique altered transcripts by keeping the effect with the highest effect score.
# unique altered transcripts spliceai_annotated_unique <- unique_mut_junc(spliceai_annotated) nrow(spliceai_annotated_unique) < nrow(spliceai_annotated)
We can read the effect predictions by MMSplice with parse_mmsplice()
mmsplice_file <- system.file("extdata/example/MMsplice.csv", package="splice2neo") # parse MMSplice results mmsplice <- parse_mmsplice(mmsplice_file) head(mmsplice)
The MMSplice data can be annotated with all potential splice junctions using annotate_mmsplice()
mmsplice_annotated <- mmsplice %>% annotate_mmsplice(transcripts = transcripts) head(mmsplice_annotated)
Junctions predicted from MMSplice results can contain duplicated altered transcripts for exon inclusion events. We use the function unique_junc_mmsplice()
to filter for unique altered transcripts by keeping the effect with the highest effect score.
# unique altered transcripts mmsplice_annotated_unique <- unique_junc_mmsplice(mmsplice_annotated) nrow(mmsplice_annotated_unique) < nrow(mmsplice_annotated)
Next, we combine splice junctions from SpliceAI and MMSplice into one dataframe using combine_mut_junc()
. This dataframe will contain unique altered transcripts (defined by mut_id
, junc_id
, tx_id
) in rows. Tool specific information are defined in columns starting with the respective tool name.
mut_junctions <- combine_mut_junc(list( "spliceai" = spliceai_annotated_unique, "mmsplice" = mmsplice_annotated_unique )) mut_junctions %>% dplyr::select(mut_id, junc_id, tx_id, spliceai_event_type, spliceai_score) %>% head()
To exclude junctions expressed in healthy tissues, we here want to filter predicted junctions for non-canonical splice junctions. We can use for instance exon-exon junctions from GENCODE and junctions found in healthy tisssue samples. Splice2neo also provides the function bed_to_junc()
which allows to transform a bed file into a list of splice junctions.
# this is just an example file with a few canonical junctions --> incomplete canonical_juncs_file <- system.file("extdata/example/test_canonical.bed", package="splice2neo") canonical_junctios <- bed_to_junc(bed_file = canonical_juncs_file, type = "exon-exon")
Annotate mutation-retrieved splice junction whether they are canonical splice junctions.
# annotated canonical junctions mut_junctions <- mut_junctions %>% mutate(is_canonical = is_canonical(junc_id, ref_junc = canonical_junctios, exons_gr = transcripts)) mut_junctions %>% count(is_canonical) # remove canonical junctions noncan_mut_junctions <- mut_junctions %>% filter(!is_canonical)
Annotate whether mutation-retrieved splice junction were also detected by a RNA-seq tool.
noncan_mut_junctions <- noncan_mut_junctions %>% mutate(is_in_rnaseq = is_in_rnaseq(junc_id, rna_juncs = rna_junctions$junc_id))
Splice2neo can annotate the resulting transcript sequence for each altered transcript with add_context_seq()
.
Here, we set the size
parameter to 400 to extract the sequence of +/-200 nucleotides around the splice junctions as context sequence.
noncan_mut_junctions <- noncan_mut_junctions %>% add_context_seq( size = 400, bsg = BSgenome.Hsapiens.UCSC.hg19::BSgenome.Hsapiens.UCSC.hg19, transcripts = transcripts ) noncan_mut_junctions %>% dplyr::select(mut_id, junc_id, tx_id, cts_id, cts_seq) %>% head()
Splice2neo can annotate the resulting peptide sequence for each altered transcript with add_peptide()
. This is relevant to analyze which altered transcript could qualify as neoantigens. We can define by the parameter flanking_size
the number of wild-type amino acids flanking the junction or novel sequence caused by the splice junction to the left and to the right. If the junction leads to a reading frame shift, sequences are flanked by wild-type amino acids to the left and are annotated until the first stop codon.
noncan_mut_junctions <- noncan_mut_junctions %>% add_peptide(flanking_size = 13, bsg = BSgenome.Hsapiens.UCSC.hg19::BSgenome.Hsapiens.UCSC.hg19, cds = cds) noncan_mut_junctions %>% dplyr::select(mut_id, junc_id, tx_id, junc_in_orf, peptide_context, frame_shift) %>% head()
Relevant intron retention events are particular challenging to predict. Splice2neo provides the function exon_in_intron()
to annotate whether the exon of another transcript is in the genomic region of a predicted retained introns. We do not want to consider them in the downstream analysis as it is difficult to differentiate if the read support from RNA-seq relates to the retained intron or to the exon of the other transcript.
# annotate exons in introns of other transcript noncan_mut_junctions <- noncan_mut_junctions %>% exon_in_intron(transcripts = transcripts)
Altered transcript from predicted splice junctions can be re-quantified in RNA-seq in a targeted manner with EasyQuant (available at https://github.com/TRON-Bioinformatics/easyquant). EasyQuant maps all RNA-seq reads to to a custom reference build from the context sequence around splice junctions. We can generate the input for EasyQuant using the splice2neo function transform_for_requant()
.
EasyQuant should be run in interval mode for the re-quantification of splice junction and it is advised to set the BP_distannce
parameter to 10 (see EasyQuant README).
# transform for EasyQuant easyquant_input <- noncan_mut_junctions %>% transform_for_requant() head(easyquant_input)
After running EasyQuant (which is not shown here), we can import the EasyQuant results with the splice2neo function map_requant()
.
# path to easyquant folder easyquant_path <- system.file("extdata/example/easyquant", package="splice2neo") # merge easyquant results requant_noncan_mut_junctions <- map_requant(path_to_easyquant_folder = easyquant_path, junc_tib = noncan_mut_junctions)
Here, we are interested in different result columns depending on the type of splice event.
In case of alternative splice sites or exon skipping events, want use the number of junction reads that map on the splice junction
(junc_interval_start
) and/or the number of spanning reads that bridge the splice junction (span_interval_start
).
# re-quantification results requant_noncan_mut_junctions %>% filter(spliceai_event_type != "intron retention") %>% dplyr::select(mut_id, junc_id, tx_id, junc_interval_start, span_interval_start) %>% head()
In case of intron retentions, we may want to consider the following columns:
within_interval
: Number of reads that map to the intron of interest coverage_perc
: Relative read coverage of the intron of interest coverage_median
: Median read coverage of the intron of interest coverage_mean
: Mean read coverage of the intron of interest. This value can be misleading by skewed read distribution and the user may rather want to use the median coverage# re-quantification results requant_noncan_mut_junctions %>% filter(spliceai_event_type == "intron retention") %>% dplyr::select( mut_id, junc_id, tx_id, within_interval, coverage_perc, coverage_median, coverage_mean ) %>% head()
# alternative to only import EasyQuant results without merging requantification_results <- read_requant(path_folder = easyquant_path) head(requantification_results)
In the analysis described above we identified mutation-retrieved splice junctions supported by RNA-seq. We have developed a stringent detection rule to predict tumor-specific targets from mutation-associated splice targets based on the mutation effect and the RNA-seq support (See @lang_prediction_2023). Currently, we suggest to focus on events from alternative splice sites and exon skipping and define such splice junctions as targets that have a |splice effect score| $\geq$ 0.35 and are found by LeafCutter/SplAdder or have at least 3 junction reads.
# exclude intron retention potential_targets <- requant_noncan_mut_junctions %>% filter(spliceai_event_type != "intron retention") potential_targets %>% select(junc_id, tx_id, spliceai_score, mmsplice_delta_logit_psi, is_in_rnaseq, junc_interval_start)
# apply detection rule targets <- potential_targets %>% filter(spliceai_score >= 0.35 | mmsplice_delta_logit_psi < -0.35) %>% filter(is_in_rnaseq | junc_interval_start >= 3)
The data in targets
can contain several altered transcripts per junctions which may lead to several neoantigen candidate sequences per target splice junction.
# apply detection rule targets %>% distinct(junc_id, peptide_context) %>% count(junc_id, name = "# neoantigen candidates") %>% arrange(-`# neoantigen candidates`)
Here, we have a list of splice junctions and want to know all potential affected transcripts.
Given a data.frame
of splice junctions (with a column junc_id
), the function add_tx()
annotates all possible transcripts which overlap
with the splice junctions.
junc_df <- dplyr::tibble( junc_id = toy_junc_id[c(1, 6, 10)] ) junc_df <- junc_df %>% add_tx(toy_transcripts) junc_df
This can lead to large data sets containing
many highly unlikely junction transcript combinations.
We can select a subset of transcripts per junction that are more likely
to be affected by a junction with choose_tx()
. Please note that this
function may loose relevant or keep irrelevant junction-transcripts in
particular in regions with multiple isoforms with distinct splicing
pattern.
selected_junc_df <- junc_df %>% choose_tx() selected_junc_df
sessionInfo()
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.