The trellis package was developed to identify somatic structural variants in tumor-only analyses of whole genome sequencing data. The types of somatic structural alterations identified by trellis include deletions, amplifications, intra- and inter-chromosomal rearrangements, inversions, and in-frame fusions. High copy amplicons are linked using the mate information from read pair analyses, simplifying the interpretation of the likely drivers. Additional packages required for this vignette are listed below, including the r Githubpkg("rscharpf/svfilters.hg19")
package that contains various sequence filters for structural variant analyses and the r Rpackage("svbams")
package that contains subsampled data used to illustrate various aspects of data processing.
library(Rsamtools) library(magrittr) library(Rsamtools) library(ggplot2) library(svbams) library(svfilters.hg19) library(tidyverse) library(gridExtra) library(trellis) library(ggnet) library(graph) library(Rgraphviz) library(network) library(sna) library(kableExtra) library(BSgenome.Hsapiens.UCSC.hg19) library(TxDb.Hsapiens.UCSC.hg19.refGene)
We assume that a collection of sorted and indexed BAM files are available with PCR duplicates marked. For this vignette, we provide a heavily subsampled bam file in the svbams
package that has already been indexed and sorted.
This section provides code for extracting all read pairs with aberrant orientation or spacing with respect to the reference genome (improper read pairs) and for computing read-depth in non-overlapping bins along the genome. These two aspects of preprocessing can be performed in parallel and are ideally suited for submission as batch jobs on a large computing cluster with one core dedicated to the processing of each sample.
The following code chunk extract all improperly paired reads in a bam file, returning a GAlignmentPairs
object where the first
and last
slots store the alignment of the first read and its mate. The time required to extract these improperly paired reads depends on the size of the bam file.
extdata <- system.file("extdata", package="svbams") bamfile <- file.path(extdata, "cgov44t_revised.bam") what <- c("flag", "mrnm", "mpos", "mapq") iparams <- improperAlignmentParams(what=what) improper_rp <- getImproperAlignmentPairs(bamfile, param=iparams, build="hg19")
To compute normalized read depth along the genome, we first define a set of non-overlapping 1kb bins that have a average mappability of at least 0.75. These bins are provided for builds hg18 and hg19 of the reference genome in the \R{} package sv.filters.<build>
. For computational speed, we focus on one part of chromosome 15. The function binnedCounts
is a wrapper for the countBam
function provided in the r Biocpkg("Rsamtools")
package.
data(bins1kb, package="svfilters.hg19") bins <- keepSeqlevels(bins1kb, "chr15", pruning.mode="coarse") flags <- scanBamFlag(isDuplicate=FALSE, isPaired=TRUE, isUnmappedQuery=FALSE, hasUnmappedMate=FALSE, isProperPair=TRUE) bviews <- BamViews(bamPaths=bamfile, bamRanges=bins) bins$cnt <- binnedCounts(bviews)
Bins with zero counts are a consequence of the artificially small bam file included in the svbams
package and we exclude these bins from further analysis.
bins <- bins[ bins$cnt > 0 ]
Next, we log-transform the raw counts and correct for GC content by loess. As a random subset of the genome is selected to model GC biases, we suggest setting a seed prior to GC correction.
bins$std_cnt <- binNormalize(bins) set.seed(123) bins$log_ratio <- binGCCorrect(bins)
If one or more normal controls is available, the normalized bin-counts can be further corrected to reduce autocorrelation. Since much of the bin-level technical variation is batch-specific, we advise against the use of historical controls for additional normalization. Below, we segment the $\log_2$ ratios by circular binary segmentation (CBS). See ?DNAcopy::segment
for a complete list of parameters available in the SegmentParam
object. Note, segmentBins
assumes a variable called log_ratio
in the GRanges
object. The object returned by segmentBins
is a GRanges
object with segment means of log-normalized coverage in the seg.mean
column.
g <- segmentBins(bins, param=SegmentParam()) g
As the above steps were restricted to a very small region of the genome, the log ratios obtained from the preceding normalization and GC correction are not interpretable as fold changes from the modal genome ploidy. Below, we load log ratios and the segmented data obtained from processing the full genome and then restrict to chromosomes 5, 8, and 15 for illustrating downstream analyses.
data(bins1kb, package="svfilters.hg19") ddir <- system.file("extdata", package="svbams", mustWork=TRUE) lr <- readRDS(file.path(ddir, "preprocessed_coverage.rds"))/1000 seqlevels(bins1kb, pruning.mode="coarse") <- paste0("chr", c(1:22, "X")) bins1kb$log_ratio <- lr bins <- keepSeqlevels(bins1kb, c("chr5", "chr8", "chr15"), pruning.mode="coarse") path <- system.file("extdata", package="svbams") segs <- readRDS(file.path(path, "cgov44t_segments.rds")) seqlevels(segs, pruning.mode="coarse") <- seqlevels(bins)
The data for chromosomes 5, 8, and 15:
segs.df <- as_tibble(segs) as_tibble(bins) %>% filter(!is.na(log_ratio)) %>% ggplot + geom_point(aes(start/1e6, log_ratio), shape = ".", color="gray") + xlab("Coordinate") + ylab("log2 normalized counts") + coord_cartesian(ylim=c(-4, 2.5)) + facet_grid(~seqnames, space="free", scales="free_x") + xlab("") + theme_bw() + geom_segment(data=segs.df, aes(x=start/1e6, xend=end/1e6, y=seg.mean, yend=seg.mean), color="black")
Candidate somatic deletions are identified from the improperly paired reads and the normalized read depth. Below, we create an object of class DeletionParm
containing a list of parameters needed for the deletion analysis. In addition, we delineate a list of the candidate hemizygous and homozygous deletions from the segmented data. For each candidate regions, we extract properly paired reads and improperly paired reads within 2kb of the breakpoints.
dp <- DeletionParam(remove_hemizygous=FALSE) dp del.gr <- IRanges::reduce(segs[segs$seg.mean < hemizygousThr(dp)], min.gapwidth=2000) proper_rp <- properReadPairs(bamfile, gr=del.gr, dp) improper_rp <- keepSeqlevels(improper_rp, seqlevels(segs), pruning.mode="coarse") read_pairs <- list(proper_del=proper_rp, improper=improper_rp)
The bin-level summaries ($\log_2$ ratios), the segmentation data, and read pair data are collected in a single object by the function preprocessData
.
pdata <- preprocessData(bam.file=bamfile, genome="hg19", bins=bins, segments=segs, read_pairs=read_pairs)
The sv_deletions
function creates an object of class StructuralVariant
that contains the deletion classification of each segment. Possible classifications include homozygous deletion (homozygous
), homozygous deletion supported by improperly paired reads (homozygous+
), hemizygous deletion (hemizygous
), and hemizygous deletion supported by improperly paired reads (hemizygous+
). For identifying somatic deletions without a matched normal, we typically exclude hemizygous deletions not supported by improperly paired reads.
if(FALSE){ temp <- list(pdata=pdata, dp=dp) saveRDS(temp, file="../tests/testthat/temp.rds") } deletions <- sv_deletions(preprocess=pdata, param=dp) variant(deletions) calls(deletions)
The improperly-paired reads supporting the homozygous+
call can be extracted as a GAlignmentPairs
object by the improper
function (Figure \@ref(fig:deletionplot)).
improper(deletions[[4]])
roi <- variant(deletions)[4] roi2 <- roi + 200e3 bins.del <- subsetByOverlaps(bins1kb, roi2) ylim <- c(-3, 2) df <- tibble(logr=bins.del$log_ratio, start=start(bins.del)) df$logr <- trellis::threshold(df$logr, ylim, amount=0.4) brks <- pretty(df$start, n=8) region <- subsetByOverlaps(segs, roi) region <- region[region$seg.mean < -1] region <- as.tibble(region) segs.df <- subsetByOverlaps(segs, roi2) %>% as.tibble xlim <- c(start(roi2), end(roi2)) C <- ggplot(df, aes(start, logr)) + geom_point(size=0.7, color="gray50") + scale_x_continuous(expand=c(0,0), breaks=brks, labels=brks/1e6)+ scale_y_continuous(expand=c(0,0)) + geom_segment(data=segs.df, aes(x=start, xend=end, y=seg.mean, yend=seg.mean), size=1) + coord_cartesian(xlim=xlim, ylim=ylim) + ylab(expression(log[2]~ratio)) + geom_rect(data=region, aes(xmin=start, xmax=end, ymin=-Inf, ymax=+Inf), fill="steelblue", color="transparent", alpha=0.3, inherit.aes=FALSE) + theme(axis.text=element_text(size=10), axis.line=element_line(color="black"), panel.background=element_rect(fill="white")) + xlab("Mb") + annotate("text", x=xlim[1] + 15e3, y=-8, label="chr15", size=3) rps <- thinReadPairs(deletions[4]) rps <- trellis:::meltReadPairs(rps) colors <- c("#0072B2", "#009E73") p <- ggplot(rps, aes(ymin=readpair-0.2, ymax=readpair+0.2, xmin=start/1e6, xmax=end/1e6, color=read, fill=read, group=readpair)) + geom_rect() + xlim(c(min(rps$start), max(rps$end))/1e6) + geom_line(aes(x=start/1e6, y=readpair), size=0.5) + ylab("Read pair index") + scale_x_continuous(breaks=scales::pretty_breaks(5)) + geom_rect(data=region, aes(xmin=start/1e6, xmax=end/1e6, ymin=-Inf, ymax=+Inf), fill="steelblue", color="transparent", alpha=0.2, inherit.aes=FALSE) + scale_color_manual(values=colors) + scale_fill_manual(values=colors) + xlab("Mb") + theme(panel.background=element_rect(fill="white"), axis.line=element_line(color="black")) + geom_vline(xintercept= c(start(roi), end(roi))/1e6, linetype="dashed", color="gray") p <- p + guides(fill=FALSE, color=FALSE) grid.arrange(C, p, ncol=1)
Trellis identifies high copy amplifications and then links the individual amplicons by improperly paired reads previously extracted from the bam file. The sv_amplicons2
function constructs an AmpliconGraph
where the nodes are the individual amplicons and an edge between two amplicons indicates multiple paired reads that link the two amplicons in the cancer genome (Figure \@ref(fig:amplicongraph)). The default parameter settings require at least 5 paired reads to support an edge. See ?ampliconParams
for customing these settings.
ag <- sv_amplicons2(pdata, params=ampliconParams()) ag
The ampliconRanges
accessor can be used to extract the genomic intervals (a GRanges
object) of the individual amplicons. The mcols
of the GRanges
object has the HUGO gene symbols of the genes spanned by the intervals (column hgnc
), the amplicon group (column groups
), and the names of known drivers (column driver
). Note, the genes in hgnc
are at the level of the amplicon but the listed drivers are at the level of the amplicon group. The amplicon group for this cancer had the known drivers FGFR4 and MYC.
granges(ampliconRanges(ag)) unique(ampliconRanges(ag)$driver)
n.passengers <- strsplit(ampliconRanges(ag)$hgnc, ", ") %>% unlist %>% "["(! . %in% c("MYC", "FGFR4")) %>% "["(!is.na(.)) %>% length
The amplicon groups simplify the interpretation of the likely drivers as the r n.passengers
other genes in this network are likely passengers.
plot_amplicons <- function (ag, colors){ if (length(ag) == 0) { message("No amplicons -- nothing to plot") return(invisible()) } dark_colors <- c("#332288", "#661100", "#882255", "#AA4499") ar <- nodes(ag) chroms <- sapply(strsplit(ar, ":"), "[", 1) sl <- unique(chroms) L <- length(sl) L <- ifelse(L > 12, 12, L) sl <- factor(chroms, levels = sl) ##color_nodes <- col_list[[L]][as.integer(sl)] color_nodes <- colors[as.integer(sl)] g1 <- graph(ag) color_nodes <- setNames(color_nodes, nodes(g1)) nodenames <- setNames(nodes(g1), nodes(g1)) text_col <- setNames(rep("black", length(nodes(g1))), nodes(g1)) text_col[color_nodes %in% dark_colors] <- "gray90" nodeRenderInfo(g1) <- list(label = nodenames, fill = color_nodes, textCol = text_col) nodeAttrs <- list(fillcolor = color_nodes) attrs <- list(node = list(shape = "rectangle", fixedsize = FALSE), graph = list(rankdir = "LR")) graph_object <- layoutGraph(g1, attrs = attrs, nodeAttrs = nodeAttrs) graph_object } colors <- c("#4477AA", "#CC6677") B <- plot_amplicons(ag, colors) ## adjacency matrix B1 <- as(B, "graphAM") am <- B1@adjMat net <- network(am, directed=FALSE) chroms <- sapply(strsplit(colnames(am), ":"), "[", 1) ar <- ampliconRanges(ag) data(transcripts, package="svfilters.hg19") hits <- findOverlaps(ar, transcripts, maxgap=5000) cancer.con <- split(transcripts$cancer_connection[subjectHits(hits)], queryHits(hits)) is.driver <- sapply(cancer.con, any) is.driver2 <- rep(FALSE, ncol(am)) is.driver2[as.integer(names(is.driver))] <- is.driver net %v% "chrom" <- chroms net %v% "driver" <- is.driver2 B <- ggnet2(net, color="chrom", palette="Dark2", shape="driver", size="degree") + ##legend.size=1) + guides(size=FALSE, shape=FALSE, color=guide_legend(title="", override.aes=list(size=5))) print(B)
The function findCandidates2
identifies clusters of reads belonging to improper pairs and links these clusters using the mate-pair information of the reads. Parameters for identifying improperly paired read clusters are specified in the RearrangementParams
function. With 30x coverage and a clonal ovarian cancer cell line, we set this value to 5. For the identification of likely somatic variants without a matched normal, we required the read pairs to be at least 10kb apart with respect to the reference genome. Using these parameters, we identify two candidate rearrangements. The data supporting these rearrangements are encapsulated in a RearrangementList
object called rlist
.
rparam <- RearrangementParams(min_number_tags_per_cluster=5, rp_separation=10e3) minNumberTagsPerCluster(rparam) rpSeparation(rparam) %>% "/"(1000) %>% paste0("kb") rlist <- findCandidates2(pdata, rparam) rlist
Each improperly paired read flanking a candidate rearrangement is typed according to the orientation and spacing of the paired reads and the type of rearrangement (inversion, translocation, etc.) is classified according to the modal type. To extract all the improperly paired reads supporting the first rearrangement, we use the function improper
, and the "[[" method defined for subsetting a RearrangementList
:
improper(rlist[[1]])
The rearranged read pair clusters link potentially distant regions within a chromosome (intra-chromosomal) or between chromosomes (inter-chromosomal). The regions linked in the cancer genome to create the novel adjacency can be accessed by the linkedBins
accessor:
linkedBins(rlist)
The linked.to
value in the mcols
of the linked bins is also a GRanges
object.
Below, we use helper functions reduceGenomeFilters
and rFilters
to assemble data needed to exclude possible germline rearrangements as well as somatic copy number alterations identified above. The GRangesList
of filters includes somatic deletions and amplifications identified from the read-depth analyses, outliers in read depth coverage, germline CNVs, and germline rearrangements. For rearrangements, the two regions that are joined in the cancer genome to create a novel adjacency are stored in a GRanges
object. The outliers, germline CNVs, and germline rearrangements were identified from a set of 10 lymphoblastoid cell lines and are provided by the svfilters.hg19
package in the germline_filters
object.
data(germline_filters, package="svfilters.hg19") genome_filters <- reduceGenomeFilters(germline_filters, seqlevels(bins1kb)) rf <- rFilters(amplicons=ampliconRanges(ag), deletions=variant(deletions), rear=germline_rear, germline=genome_filters) rdat <- rearrangementData(rlist=rlist, read_pairs=list(improper=improper_rp), filters=rf) rlist <- filterRear(rdat, rparam)
The rearangements identified thus far are required to have a physical separation of at least rp_separation
with respect to the reference genome, must be supported by at least min_number_tags_per_cluster
paired reads, at least prop_modal_rearrangement
of the read pairs must be consistent with the modal rearrangement type, and the size of the read clusters must be at least min_cluster_size
basepairs. In this section, we realign all improperly paired reads supporting a candidate rearrangement (mapped-mapped) with BLAT and extract unmapped reads with aligned mate near candidate sequence junctions (mapped-unmapped) that may correspond to split reads. Since split reads contain the actual sequence junction, these reads can usually determine the exact breakpoint within a few basepairs.
For each rearrangement identified in the above analyses, we have saved the set of improperly paired reads supporting the new sequence junction. Because both reads belonging to the improper pairs have been mapped to the reference genome, this section describes the realignment of 'mapped-mapped' pairs. In the following code chunk, we extract the tag sequence of all improperly paired reads in a RearrangementList
and write these sequences to file in fasta format. Note, the MAX
argument to getSequenceOfReads
below indicates the maximum number of sequences for a specific rearrangement to extract from the BAM file. For example, the first rearrangement in the rlist
object had r length(improper(rlist[[1]]))
improperly paired reads spanning the sequence junction. Setting the MAX
parameter to 25 (MAX=25
), only 25 of the 50 read pairs are randomly selected for re-alignment by BLAT. Again, we suggest setting a seed for reproducibility.
set.seed(123) tags <- getSequenceOfReads(rlist, bamfile, MAX=25L, build = "hg19") dir <- tempdir() fa.file <- file.path(dir, paste0(basename(bamfile), ".fa")) writeToFasta(tags, fa.file)
The unevaluated code below is a system
call to run the command-line version of BLAT. In addition to requiring installation of BLAT, we must also have a copy of the appropriate reference genome available.
blatpath <- "~/bin/x86_64/blat" refgenome <- "~/Dropbox/reference_genome/hg19.fa" outfile <- tempfile() cmd <- paste(blatpath, refgenome, fa.file, outfile) system(cmd) file.copy(outfile, "../../svalignments/inst/extdata/blat_alignment.txt")
Here, we read the previously saved BLAT alignments for this data. We require that each read have only one near perfect match in the genome (score > 90) and, if there is a near-perfect match, the near-perfect match must be consistent with the whole genome aligner (Figure \@ref("blatscores")).
extdata <- system.file("extdata", package="svbams") blat.file <- file.path(extdata, "blat_alignment.txt") blat_aln <- readBlat(blat.file) records <- annotateBlatRecords(blat_aln, tags) blatScores(records, tags, id="CGOV44T", min.tags=5, prop.pass=0.8)
ggplot(records, aes(Qname, match)) + geom_jitter(width=0.05, aes(color=is_overlap), size=0.5) + geom_hline(yintercept=90, linetype="dashed") + scale_color_manual(values=c("gray", "black")) + facet_wrap(~rid, ncol=1, scales="free_x") + theme(axis.text.x=element_text(angle=90, size=5), panel.background=element_rect(fill="white")) + guides(color=guide_legend(title="Concordance of BLAT\nand original algnment")) + ylab("BLAT score")
In addition to using BLAT to confirm the whole genome aligner for improperly paired reads supporting a rearrangement, we also use BLAT to identify split read alignments that directly span the sequence junction. Here, we query the BAM file for all read pairs in which a read was mapped near the putative rearrangement but its mate was not aligned (mapped-unmapped). If the unmapped mate spans the sequence junction, we will assess whether BLAT aligns a subsequence of the read to one cluster and the complement of the subsequence to the second cluster. Such reads further improve the specificity of the rearrangement and establish basepair resolution of the sequence junction. First, we construct a GRanges
object of all the linked read clusters identified in our rearrangement analyses. This GRanges
object will be used to query the BAM file for read pairs in which only one read of a pair was aligned near the putative rearrangement. Again, we write the sequence of the unmapped reads to a fasta file and call BLAT by a system call.
query <- uncouple(linkedBins(rlist)) unmapped <- unmapped_read(bamfile, query, yield_size=200000) length(unmapped) dir <- tempdir() mapped_unmapped.fa <- file.path(dir, "mapped-unmapped.fa") writeUnmappedToFasta(unmapped, mapped_unmapped.fa)
The object unmapped
is a GRanges
object of 87 reads that were not mapped by ELAND to the reference genome and that have a mate mapped to one of the intervals in query
. Again, we use a system call in the following unevaluated code chunk to realign the unmapped reads:
outfile <- tempfile() cmd <- paste(blatpath, refgenome, mapped_unmapped.fa, outfile) system(cmd) file.copy(outfile, "../../svalignments/inst/extdata/blat_unmapped.txt")
TODO: create a script in data-raw/ that creates the above blat_unmapped.txt object.
unmap.file <- file.path(extdata, "blat_unmapped.txt") blat_unmap <- readBlat(unmap.file)
Recall that our QC analysis of the mapped-mapped read pairs focused on whether the original whole genome alignment was the only high-scoring BLAT alignment. Here, our goal is to assess whether BLAT splits the alignment of reads that are initially unmapped by the whole genome aligner and have a mate that is aligned to the approximate location of the rearrangement. The locations in the first rearrangement (with respect to the reference genome) that we think are joined in the somatic genome are given by the linked bins of the RearrangementList
object. We want to assess whether BLAT aligns part of a subsequence of a read to the interval given by
granges(linkedBins(rlist)[1])
and the complement of the subsequence to the interval given by
linkedTo(rlist)[1]
We use the function rearrangedReads
to identify BLAT records that correspond to split reads supporting a rearrangement. Depending on the size of the sequenced DNA fragments, the improperly paired reads can only approximate the location of a sequence junction (likely to within 100 basepairs). Finally, we check that each rearrangement is supported by one or more reads that map to both sides of the sequence junction using the function is_valid_splits
and we assign the split reads to the RearrangementList
object using the splitReads<-
method.
split_reads <- rearrangedReads(linkedBins(rlist), blat_unmap, 500) elementNROWS(split_reads) split_reads splitReads(rlist) <- split_reads is_valid <- is_valid_splits(rlist, maxgap=50) at_least_one <- lengths(splitReads(rlist)) >= 1 rlist2 <- rlist[ is_valid & at_least_one ]
n.split <- elementNROWS(splitReads(rlist2))
The above BLAT analysis identifies r n.split[1]
and r n.split[2]
split reads for the two rearrangements named 1-2
and 3-4
, respectively. Each of these rearrangements have two possible 5-prime to 3-prime orientations. The function fiveTo3List
places the linked bins in their 5-prime to 3-prime orientation. Note, in the code below that rlist2
is now twice the length of the original rlist
object as each rearrangement has been placed in two possible orientations. Because some of the split reads may be filtered when evaluating the orientation of the sequence junction, we again filter rearrangements that are not supported by at least one split read.
rlist3 <- fiveTo3List(rlist2, build="hg19") rlist3 <- rlist3[ is_valid_splits(rlist3, maxgap=50) ] rlist3
To visualize the improperly paired reads and the split reads supporting a rearrangement, we first collect the supporting reads in a tibble
. The function rearDataFrameList
takes a RearrangementList
as input and extracts the supporting reads belonging to the first two elements of the list that correspond to the two possible 5-prime to 3-prime orientations. Next, we use ggRearrange
, a wrapper to ggplot
, to visualize the supporting reads. Because sequence junctions do not overlap a transcript, we have arbitarily labeled the two genomic regions that are joined in the somatic genome as noncoding1
and noncoding2
.
df <- rearDataFrame(rlist3[[1]], build="hg19") grobs <- ggRearrange(df) grobs[["arranged.grobs"]]
The second rearrangement in the original rlist
object now corresponds to elements 3 and 4 of the rlist3
object. Again, we call rearDataFrame
and ggRearrange
to organize and then plot the supporting reads.
df2 <- rearDataFrame(rlist3[[3]], build="hg19") ggRearrange(df2)[["arranged.grobs"]]
We begin by selecting only the rearrangements in which both ends of a sequence junction are within 5kb of a known transcript.
near.coding <- seqJunctionNearTx(rlist=rlist3, build='hg19') rlist4 <- rlist3[ near.coding ] length(rlist4)
Next, we make the definition of the sequence junctions more precise.
jxns <- seqJunctions_Rlist(rlist4) jxns
In addition, we require the 3-prime genomic region of the junction to lie strictly within a transcript, while the 5-prime genomic region forming the junction can occur either in the promoter or within the 5-prime transcript. Operationally, we define the promoter as the genomic region 5kb upstream of the transcription start site of the 5-prime gene. We refer to the remaining junctions as coding_jxns
. As the coding junctions are named by the rearrangement ids, we can use these names to subset the rlist
object, excluding those rearrangements that do not meet the above criteria.
coding_jxns <- codingJunctions(jxns, "hg19") coding_jxns rlist4 <- rlist4[ names(coding_jxns) ]
Next, we use the fuseCDS_Rlist
function to extract for each rearrangement the full CDS of the fused sequence in the somatic genome (fusions
), the partial CDS of the 5-prime (tum.5p
) and 3-prime (tum.3p
) transcripts that are involved in the fusion, and the full CDS of the 5-prime (ref.5p
) and 3-prime transcripts in the reference genome. Note, we use a single bracket [
when subsetting the RearrangementList
so that the resulting object is still an instance of RearrangementList
(albeit a length-one list).
cds.list <- fuseCDS_Rlist(rlist4, coding_jxns) cds.list
For each fusion, we translate the CDS in each of the three possible reading frames and partition the amino acid sequence of the fusion protein into the sequences derived from the 5-prime and 3-prime transcripts. By partioning the amino acid sequence of the fusion into its 5-prime and 3-prime parts, we can compare the 5-prime partition to the amino acid sequence of the 5-prime reference protein and the 3-prime partition to the amino acid sequence of the 3-prime reference protein. The function translateCDS
translates the CDS in the 3 possible reading frames.
proteins <- translateCDS(cds.list) proteins
The amino acid sequence obtained by translating the CDS involved in the 5-prime and 3-prime transcripts are represented as AAStringSet
objects. Since a single gene can have multiple transcripts, we translate every possible combination of 5-prime and 3-prime transcripts in each of the 3 frames. Here, we show the amino acid sequences for the two possible combinations of transcripts involving TLN2 and TPM1:
proteins$fusion[[1]]
We can also list the amino acid sequences derived from the 5-prime gene and the 3-prime gene:
proteins$tum5p[[1]] proteins$tum3p[[1]]
The function partitionAASequence
reorganizes the list of amino acid sequences by frame to facilitate downstream computation.
partition.fusion <- partitionAASequence(proteins) partition.fusion
For a fusion to be in-frame, we require that the amino acid sequence derived from the 5- and 3-prime transcripts to be subsequences of the full amino acid sequence of the reference genome that was translated in the same frame. In addition, we require that there be no premature stop codons. The following code accomplishes these two tasks, with the ref.frames
object below containing the full amino acid sequence of the transcripts associated with IKAROS and ERBB4.
nostop.list <- noPrematureStop(partition.fusion) nostop.list ref.frames <- organizeReferenceByFrame(proteins) inframe.list <- inFrameList(fusion.frames=partition.fusion, ref.frames=ref.frames) inframe.list
Each element of the nostop.list
and inframe.list
lists are an object of class LogicalList
, each evaluated from a different reading frame. The LogicalList
objects are named by the rearrangement identifier. The helper function inFrameNoStop
combines these to lists to create a single list that is TRUE
for rearrangements that are in-frame and have no premature stop codons.
inframe.nostop <- inFrameNoStop(nostop.list, inframe.list) inframe.nostop
Finally, validFusions
returns a list object containing only the fusions that are in-frame and have no premature stops. For each such fusion, the amino acid sequence, CDS of the fusion, partial CDS of the 5-prime and 3-prime transcripts, and the full CDS of the reference transcripts are available.
valid.fusions <- validFusions(partition.fusion, cds.list, inframe.nostop, ref.frames) valid.fusions
The above data can be summarized in tabular format using the fusionTable2
function:
tab <- fusionTable2(valid.fusions) tab %>% kable("html", format.args=list(big.mark=",")) %>% kable_styling(bootstrap_options=c("striped", "hover"))
Note, this table provides both the genomic coordinates of the fusion (junction.5prime, junction.3prime), but also the portion of the amino acid sequence that was retained by the fusion (aa.5prime.start, aa.5prime.end, aa.3prime.start, aa.3prime.end). In the following code chunk, we construct amino acid ranges for the genes involved in the fusion and the protein domains for these genes in the reference genome.
tumor_aa_ranges <- aa_granges(tab) extdata2 <- system.file("extdata", package="svbams") up <- readRDS(file.path(extdata2, "uniprot.rds")) up2 <- uniprotFeatures(up, tab, strwrap.width=20) domain_aa_ranges <- GRanges(up2$hugo, IRanges(up2$start, up2$end), chromosome=up2$seqnames, description=up2$description, short.desc=up2$short.desc, aa_len=up2$aa_len) domains <- subsetByOverlaps(domain_aa_ranges, tumor_aa_ranges, type="within") domains
sessionInfo()
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.