library(BiocStyle)
The concept of mutational signatures was introduced in a series of papers by Ludmil Alexandrov et al. @Alex2013 and [@Alex_CellRep2013]. A computational framework was published [@Alex_package2012] with the purpose to detect a limited number of mutational processes which then describe the whole set of SNVs (single nucleotide variants) in a cohort of cancer samples. The general approach @Alex2013 is as follows:
Of course the whole concept only leads to a reduction in complexity if $l < n$, i.e. if the number of signatures is smaller than the number of features, as indicated in the above figure. Note that the NMF itself solves the above problem for a given number of signatures $l$. The framework of Ludmil Alexandrov et al. @Alex2013 performs not only the NMF decomposition itself, but also identifies the appropriate number of signatures by an iterative procedure.
Another software, the Bioconductor package r Biocpkg("SomaticSignatures")
to
perform analyses of mutational signatures is available [@Gehring_article2015]
which allows the matrix decomposition to be performed by NMF and alternatively
by PCA (principal component analysis). Both methods have in common that they
can be used for discovery, i.e. for the extraction of new signatures.
However, they only work well if the analyzed data set has a minimum size, i.e.
a minimum number of samples and minimum numbers of counts per feature per
sample.
The package YAPSA introduced here is complementary to these existing software packages. It is designed for a supervised analysis of mutational signatures, i.e. an analysis with already known signatures $W$, and with much lower requirements on statistical power of the input data.
In a context where mutational signatures $W$ are already known (because they were decribed and published as in @Alex2013 or they are available in a database as under http://cancer.sanger.ac.uk/cosmic/signatures), we might want to just find the exposures $H$ for these known signatures in the mutational catalogue $V$ of a given cohort. Mathematically, this is a different and potentially simpler task.
The YAPSA-package (Yet Another Package for Signature Analysis) presented
here provides the function LCD
(linear combination decomposition)
to perform this task. The advantage of this method is that there are no
constraints on the cohort size, so LCD
can be run for as little as one
sample and thus be used e.g. for signature analysis in personalized oncology.
In contrast to NMF, LCD
is very fast and requires very little computational
resources. The YAPSA package provides additional functions for signature
analysis, e.g. for stratifying the mutational catalogue to determine signature
exposures in different strata, part of which will be discussed later in this
vignette.
In the following, we will denote the columns of $V$ by $V_{(\cdot j)}$, which
corresponds to the mutational catalogue of sample $j$. Analogously we denote
the columns of $H$ by $H_{(\cdot j)}$, which is the exposure vector of sample
$j$. Then LCD
is designed to solve the optimization problem:
(@LCD_formula) $$ \begin{aligned} \min_{H_{(\cdot j)} \in \mathbb{R}^l}||W \cdot H_{(\cdot j)} - V_{(\cdot j)}|| \quad \forall j \in {1...m} \ \textrm{under the constraint of non-negativity:} \quad H_{(ij)} >= 0 \quad \forall i \in {1...l} \quad \forall j \in {1...m} \end{aligned} $$
Remember that $j$ is the index over samples, $m$ is the number of samples,
$i$ is the index over signatures and $l$ is the number of signatures. LCD
uses the function lsei
from the package r CRANpkg("lsei")
[@lsei_package2015]. Note that the optimization
procedure is carried out for every $V_{(\cdot j)}$, i.e. for every column of
$V$ separately. Of course $W$ is constant, i.e. the same for every
$V_{(\cdot j)}$.
This procedure is highly sensitive: as soon as a signature has a contribution
or an exposure in at least one sample of a cohort, it will be reported (within
the floating point precision of the operating system). This might blur the
picture and counteracts the initial purpose of complexity reduction. Therefore
there is a function LCD_complex_cutoff
. This function takes as a second
argument a cutoff (a value between zero and one). In the analysis, it will keep
only those signatures which have a cumulative (over the cohort) normalized
exposure greater than this cutoff. In fact it runs the LCD-procedure twice:
once to find initial exposures, summing over the cohort and excluding the ones
with too low a contribution as described just above, and a second time doing
the analysis only with the signatures left over. Beside the exposures H
corresponding to this reduced set of signatures, the function
LCD_complex_cutoff
also returns the reduced set of signatures itself.
Another R package for the supervised analysis of mutational signatures is
available: r CRANpkg("deconstructSigs")
[@Rosenthal_2016]. One difference
between LCD_complex_cutoff
as described here in YAPSA
and the corresponding
function whichSignatures
in r CRANpkg("deconstructSigs")
is that
LCD_complex_cutoff
accepts different cutoffs and signature-specific cutoffs
(accounting for potentially different detectability of different signatures),
whereas in whichSignatures
in r CRANpkg("deconstructSigs")
a general fixed
cutoff is set to be 0.06. Furthermore, YAPSA has a function for a stratified
analysis as described in the following paragraph.
For some questions it is useful to assign the SNVs detected in the samples of a
cohort to categories. In the following these categories have to be exclusive,
i.e. one SNV must be in one and only one category. The categories will be
called strata in the following and the procedure stratification. The
number of strata will be denoted by $s$. For example one could evaluate whether
an SNV falls into a region of high, intermediate or low mutation density by
applying meaningful cutoffs on intermutation distance. Following the above
convention, there are three strata: high, intermediate and low mutation
density. If we have already performed an analysis of mutational signatures for
the whole mutational catalogue of the cohort, we have identified a set of
signatures of interest and the corresponding exposures. We now could ask the
question if some of the signatures are enriched or depleted in one or the other
stratum, yielding a strata-specific signature enrichment and depletion pattern.
The function SMC
(Stratification of the Mutational Catalogue) solves the
stratified optimization problem:
(@SMC_formula) $$ \begin{aligned} \min_{H_{(\cdot j)}^k \in \mathbb{R}^l}||W \cdot H_{(\cdot j)}^{k} - V_{(\cdot j)}^{k}|| \quad \forall j,k \ \textrm{under the constraint of non-negativity:} \quad H_{(ij)}^{k} >= 0 \quad \forall i,j,k \ \textrm{and the additional contraint:} \quad \sum_{k=1}^{s} H^{k} = H \ \textrm{where H is defined by the optimization:} \quad \min_{H_{(\cdot j)} \in \mathbb{R}^l}||W \cdot H_{(\cdot j)} - V_{(\cdot j)}|| \quad \forall j \ \textrm{also under the constraint of non-negativity:} \quad H_{(ij)} >= 0 \quad \forall i,j \quad \textrm{and} \quad V = \sum_{k=1}^{s} V^{k} \end{aligned} $$
Remember that $j$ is the index over samples, $m$ is the number of samples, $i$ is the index over signatures, $l$ is the number of signatures, $k$ is the index over strata and $s$ is the number of strata. Note that the last two lines of equation (@SMC_formula) correspond to equation (@LCD_formula). The very last part of equation (@SMC_formula) reflects the additivity of the stratified mutational catalogues $V^{k}$ which is due to the fact that by definition the sets of SNVs they were constructed from (i.e. the strata) are exclusive.
The SMC-procedure can also be applied when an NMF analysis has been performed and the exposures $\widetilde{H}$ of this NMF analysis should be used as input and constraint for the SMC. It then solves the task:
(@SMC_NMF_input_formula) $$ \begin{aligned} \min_{H_{(\cdot j)}^k \in \mathbb{R}^l}||W \cdot H_{(\cdot j)}^{k} - V_{(\cdot j)}^{k}|| \quad \forall j,k \ \textrm{under the constraint of non-negativity:} \quad H_{(ij)}^{k} >= 0 \quad \forall i,j,k \ \textrm{and the additional contraint:} \quad \sum_{k=1}^{s} H^{k} = \widetilde{H} \ \end{aligned} $$
Applying SMC
that way, the initial LCD decomposition of the unstratified
mutational catalogue is omitted and it's result replaced by the exposures
extracted by the NMF analysis.
We will now apply some functions of the YAPSA package to Whole Genome Sequencing datasets published in Alexandrov et al. [-@Alex2013] First we have to load this data and get an overview (first subsection). Then we will load data on published signatures (second subsection). Only in the third subsection we will actually start using the YAPSA functions.
library(YAPSA) library(knitr) opts_chunk$set(echo=TRUE) opts_chunk$set(fig.show='asis')
In the following, we will load and get an overview of the data used in the analysis by Alexandrov et al. @Alex2013
data("lymphoma_Nature2013_raw")
This created a data frame with r dim(lymphoma_Nature2013_raw_df)[1]
rows. It
is equivalent to executing the R code
lymphoma_Nature2013_ftp_path <- paste0( "ftp://ftp.sanger.ac.uk/pub/cancer/AlexandrovEtAl/", "somatic_mutation_data/Lymphoma B-cell/", "Lymphoma B-cell_clean_somatic_mutations_", "for_signature_analysis.txt") lymphoma_Nature2013_raw_df <- read.csv(file=lymphoma_Nature2013_ftp_path, header=FALSE,sep="\t")
The format is inspired by the vcf format with one line per called variant. Note that the files provided at that url have no header information, therefore we have to add some. We will also slightly adapt the data structure:
names(lymphoma_Nature2013_raw_df) <- c("PID","TYPE","CHROM","START", "STOP","REF","ALT","FLAG") lymphoma_Nature2013_df <- subset(lymphoma_Nature2013_raw_df,TYPE=="subs", select=c(CHROM,START,REF,ALT,PID)) names(lymphoma_Nature2013_df)[2] <- "POS" kable(head(lymphoma_Nature2013_df))
Here, we have selected only the variants characterized as subs
(those are the
single nucleotide variants we are interested in for the mutational signatures
analysis, small indels are filtered out by this step), so we are left with
r dim(lymphoma_Nature2013_df)[1]
variants or rows. Note that there are r length(unique(lymphoma_Nature2013_df$PID))
different samples:
unique(lymphoma_Nature2013_df$PID)
For convenience later on, we annotate subgroup information to every variant (indirectly through the sample it occurs in). For reasons of simplicity, we also restrict the analysis to the Whole Genome Sequencing (WGS) datasets:
lymphoma_Nature2013_df$SUBGROUP <- "unknown" DLBCL_ind <- grep("^DLBCL.*",lymphoma_Nature2013_df$PID) lymphoma_Nature2013_df$SUBGROUP[DLBCL_ind] <- "DLBCL_other" MMML_ind <- grep("^41[0-9]+$",lymphoma_Nature2013_df$PID) lymphoma_Nature2013_df <- lymphoma_Nature2013_df[MMML_ind,] data(lymphoma_PID) for(my_PID in rownames(lymphoma_PID_df)) { PID_ind <- which(as.character(lymphoma_Nature2013_df$PID)==my_PID) lymphoma_Nature2013_df$SUBGROUP[PID_ind] <- lymphoma_PID_df$subgroup[which(rownames(lymphoma_PID_df)==my_PID)] } lymphoma_Nature2013_df$SUBGROUP <- factor(lymphoma_Nature2013_df$SUBGROUP) unique(lymphoma_Nature2013_df$SUBGROUP)
Rainfall plots provide a quick overview of the mutational load of a sample. To this end we have to compute the intermutational distances. But first we still do some reformatting...
lymphoma_Nature2013_df <- translate_to_hg19(lymphoma_Nature2013_df,"CHROM") lymphoma_Nature2013_df$change <- attribute_nucleotide_exchanges(lymphoma_Nature2013_df) lymphoma_Nature2013_df <- lymphoma_Nature2013_df[order(lymphoma_Nature2013_df$PID, lymphoma_Nature2013_df$CHROM, lymphoma_Nature2013_df$POS),] lymphoma_Nature2013_df <- annotate_intermut_dist_cohort(lymphoma_Nature2013_df, in_PID.field="PID") data("exchange_colour_vector") lymphoma_Nature2013_df$col <- exchange_colour_vector[lymphoma_Nature2013_df$change]
Now we can select one sample and make the rainfall plot. The plot function used
here relies on the package r Biocpkg("gtrellis")
by Zuguang Gu
[@gtrellis2015].
choice_PID <- "4121361" PID_df <- subset(lymphoma_Nature2013_df,PID==choice_PID) trellis_rainfall_plot(PID_df,in_point_size=unit(0.5,"mm"))
This shows a rainfall plot typical for a lymphoma sample with clusters of increased mutation density e.g. at the immunoglobulin loci. \newpage
As stated above, one of the functions in the YAPSA package (LCD
) is
designed to do mutational signatures analysis with known signatures. There are
(at least) two possible sources for signature data: i) the ones published
initially by Alexandrov et al. @Alex2013, and ii) an updated and curated
current set of mutational signatures is maintained by Ludmil Alexandrov at http://cancer.sanger.ac.uk/cosmic/signatures. The following three subsections
describe how you can load the data from these resources. Alternatively, you can
bypass the three following subsections because the signature datasets are also
included in this package:
data(sigs)
However, the curated set of signatures might change in the future, therefore it is recommended to rebuild it from the above mentioned website as described in the following subsections.
We first load the (older) set of signatures as published in Alexandrov et al.
Alex_signatures_path <- paste0("ftp://ftp.sanger.ac.uk/pub/cancer/", "AlexandrovEtAl/signatures.txt") AlexInitialArtif_sig_df <- read.csv(Alex_signatures_path,header=TRUE,sep="\t") kable(AlexInitialArtif_sig_df[c(1:9),c(1:4)])
We will now reformat the data frame:
Alex_rownames <- paste(AlexInitialArtif_sig_df[,1], AlexInitialArtif_sig_df[,2],sep=" ") select_ind <- grep("Signature",names(AlexInitialArtif_sig_df)) AlexInitialArtif_sig_df <- AlexInitialArtif_sig_df[,select_ind] number_of_Alex_sigs <- dim(AlexInitialArtif_sig_df)[2] names(AlexInitialArtif_sig_df) <- gsub("Signature\\.","A", names(AlexInitialArtif_sig_df)) rownames(AlexInitialArtif_sig_df) <- Alex_rownames kable(AlexInitialArtif_sig_df[c(1:9),c(1:6)], caption="Exemplary data from the initial Alexandrov signatures.")
This results in a data frame for signatures, containing r number_of_Alex_sigs
signatures as column vectors. It is worth noting that in the initial
publication, only a subset of these r number_of_Alex_sigs
signatures were
validated by an orthogonal sequencing technology. So we can filter down:
AlexInitialValid_sig_df <- AlexInitialArtif_sig_df[,grep("^A[0-9]+", names(AlexInitialArtif_sig_df))] number_of_Alex_validated_sigs <- dim(AlexInitialValid_sig_df)[2]
We are left with r number_of_Alex_validated_sigs
signatures.
An updated and curated set of mutational signatures is maintained by Ludmil Alexandrov at http://cancer.sanger.ac.uk/cosmic/signatures. We will use this set for the following analysis:
Alex_COSMIC_signatures_path <- paste0("http://cancer.sanger.ac.uk/cancergenome/", "assets/signatures_probabilities.txt") AlexCosmicValid_sig_df <- read.csv(Alex_COSMIC_signatures_path, header=TRUE,sep="\t") Alex_COSMIC_rownames <- paste(AlexCosmicValid_sig_df[,1], AlexCosmicValid_sig_df[,2],sep=" ") COSMIC_select_ind <- grep("Signature",names(AlexCosmicValid_sig_df)) AlexCosmicValid_sig_df <- AlexCosmicValid_sig_df[,COSMIC_select_ind] number_of_Alex_COSMIC_sigs <- dim(AlexCosmicValid_sig_df)[2] names(AlexCosmicValid_sig_df) <- gsub("Signature\\.","AC", names(AlexCosmicValid_sig_df)) rownames(AlexCosmicValid_sig_df) <- Alex_COSMIC_rownames kable(AlexCosmicValid_sig_df[c(1:9),c(1:6)], caption="Exemplary data from the updated Alexandrov signatures.")
This results in a data frame containing r number_of_Alex_COSMIC_sigs
signatures as column vectors. For reasons of convenience and comparability with
the initial signatures, we reorder the features. To this end, we adhere to the
convention chosen in the initial publication by Alexandrov et al. @Alex2013
for the initial signatures.
COSMIC_order_ind <- match(Alex_rownames,Alex_COSMIC_rownames) AlexCosmicValid_sig_df <- AlexCosmicValid_sig_df[COSMIC_order_ind,] kable(AlexCosmicValid_sig_df[c(1:9),c(1:6)], caption=paste0("Exemplary data from the updated Alexandrov ", "signatures, rows reordered."))
Note that the order of the features, i.e. nucleotide exchanges in their trinucleotide content, is changed from the fifth line on as indicated by the row names.
For every set of signatures, the functions in the YAPSA package require an
additional data frame containing meta information about the signatures. In that
data frame you can specify the order in which the signatures are going to be
plotted and the colours asserted to the different signatures. In the following
subsection we will set up such a data frame. However, the respective data
frames are also stored in the package. If loaded by data(sigs)
the following
code block can be bypassed.
signature_colour_vector <- c("darkgreen","green","pink","goldenrod", "lightblue","blue","orangered","yellow", "orange","brown","purple","red", "darkblue","magenta","maroon","yellowgreen", "violet","lightgreen","sienna4","deeppink", "darkorchid","seagreen","grey10","grey30", "grey50","grey70","grey90") bio_process_vector <- c("spontaneous deamination","spontaneous deamination", "APOBEC","BRCA1_2","Smoking","unknown", "defect DNA MMR","UV light exposure","unknown", "IG hypermutation","POL E mutations","temozolomide", "unknown","APOBEC","unknown","unknown","unknown", "unknown","unknown","unknown","unknown","unknown", "nonvalidated","nonvalidated","nonvalidated", "nonvalidated","nonvalidated") AlexInitialArtif_sigInd_df <- data.frame(sig=colnames(AlexInitialArtif_sig_df)) AlexInitialArtif_sigInd_df$index <- seq_len(dim(AlexInitialArtif_sigInd_df)[1]) AlexInitialArtif_sigInd_df$colour <- signature_colour_vector AlexInitialArtif_sigInd_df$process <- bio_process_vector COSMIC_signature_colour_vector <- c("green","pink","goldenrod", "lightblue","blue","orangered","yellow", "orange","brown","purple","red", "darkblue","magenta","maroon", "yellowgreen","violet","lightgreen", "sienna4","deeppink","darkorchid", "seagreen","grey","darkgrey", "black","yellow4","coral2","chocolate2", "navyblue","plum","springgreen") COSMIC_bio_process_vector <- c("spontaneous deamination","APOBEC", "defect DNA DSB repair hom. recomb.", "tobacco mutatgens, benzo(a)pyrene", "unknown", "defect DNA MMR, found in MSI tumors", "UV light exposure","unknown","POL eta and SHM", "altered POL E", "alkylating agents, temozolomide", "unknown","APOBEC","unknown", "defect DNA MMR","unknown","unknown", "unknown","unknown", "associated w. small indels at repeats", "unknown","aristocholic acid","unknown", "aflatoxin","unknown","defect DNA MMR", "unknown","unknown","tobacco chewing","unknown") AlexCosmicValid_sigInd_df <- data.frame(sig=colnames(AlexCosmicValid_sig_df)) AlexCosmicValid_sigInd_df$index <- seq_len(dim(AlexCosmicValid_sigInd_df)[1]) AlexCosmicValid_sigInd_df$colour <- COSMIC_signature_colour_vector AlexCosmicValid_sigInd_df$process <- COSMIC_bio_process_vector
Now we can start using the functions from the YAPSA package. We will start with
a mutational signatures analysis using known signatures (the ones we loaded in
the above paragraph). For this, we will use the functions LCD
and
LCD_complex_cutoff
.
This section uses functions which are to a large extent wrappers for functions in the package SomaticSignatures by Julian Gehring [@Gehring_article2015].
library(BSgenome.Hsapiens.UCSC.hg19)
word_length <- 3 lymphomaNature2013_mutCat_list <- create_mutation_catalogue_from_df( lymphoma_Nature2013_df, this_seqnames.field = "CHROM", this_start.field = "POS", this_end.field = "POS", this_PID.field = "PID", this_subgroup.field = "SUBGROUP", this_refGenome = BSgenome.Hsapiens.UCSC.hg19, this_wordLength = word_length)
The function create_mutation_catalogue_from_df
returns a list object with
several entries. We will use the one called matrix
.
names(lymphomaNature2013_mutCat_list) lymphomaNature2013_mutCat_df <- as.data.frame( lymphomaNature2013_mutCat_list$matrix) kable(lymphomaNature2013_mutCat_df[c(1:9),c(5:10)])
\newpage
The LCD
function performs the decomposition of a mutational catalogue into a
priori known signatures and the respective exposures to these signatures as
described in the second section of this vignette. We use the "new" signatures
from the COSMIC website.
current_sig_df <- AlexCosmicValid_sig_df current_sigInd_df <- AlexCosmicValid_sigInd_df lymphomaNature2013_COSMICExposures_df <- LCD(lymphomaNature2013_mutCat_df,current_sig_df)
Some adaptation (extracting and reformatting the information which sample belongs to which subgroup):
COSMIC_subgroups_df <- make_subgroups_df(lymphoma_Nature2013_df, lymphomaNature2013_COSMICExposures_df)
The resulting signature exposures can be plotted using custom plotting functions. First as absolute exposures:
exposures_barplot( in_exposures_df = lymphomaNature2013_COSMICExposures_df, in_subgroups_df = COSMIC_subgroups_df)
Here, as no colour information was given to the plotting function
exposures_barplot
, the identified signatures are coloured in a rainbow
palette. If you want to assign colours to the signatures, this is possible via
a data structure of type sigInd_df
.
exposures_barplot( in_exposures_df = lymphomaNature2013_COSMICExposures_df, in_signatures_ind_df = current_sigInd_df, in_subgroups_df = COSMIC_subgroups_df)
This figure has a colour coding which suits our needs, but there is one slight
inconsistency: colour codes are assigned to all r dim(current_sig_df)[2]
provided signatures, even though some of them might not have any contributions
in this cohort:
rowSums(lymphomaNature2013_COSMICExposures_df)
This can be overcome by using LCD_complex_cutoff
. Is requires an additional
parameter: in_cutoff_vector
; this is already the more general framework which
will be explained in more detail in the following section.
zero_cutoff_vector <- rep(0,dim(current_sig_df)[2]) CosmicValid_cutoffZero_LCDlist <- LCD_complex_cutoff( in_mutation_catalogue_df = lymphomaNature2013_mutCat_df, in_signatures_df = current_sig_df, in_cutoff_vector = zero_cutoff_vector, in_sig_ind_df = current_sigInd_df)
We can re-create the subgroup information (even though this is identical to the already determined one):
COSMIC_subgroups_df <- make_subgroups_df(lymphoma_Nature2013_df, CosmicValid_cutoffZero_LCDlist$exposures)
And if we plot this, we obtain:
exposures_barplot( in_exposures_df = CosmicValid_cutoffZero_LCDlist$exposures, in_signatures_ind_df = CosmicValid_cutoffZero_LCDlist$out_sig_ind_df, in_subgroups_df = COSMIC_subgroups_df)
Note that this time, only the
r dim(CosmicValid_cutoffZero_LCDlist$exposures)[1]
signatures which actually
have a contribution to this cohort are displayed in
the legend.
Of course, also relative exposures may be plotted:
exposures_barplot( in_exposures_df = CosmicValid_cutoffZero_LCDlist$norm_exposures, in_signatures_ind_df = CosmicValid_cutoffZero_LCDlist$out_sig_ind_df, in_subgroups_df = COSMIC_subgroups_df)
Now let's rerun the analysis with a cutoff to discard signatures with insufficient cohort-wide contribution.
my_cutoff <- 0.06
The cutoff of r my_cutoff
means that a signature is kept if it's exposure
represents at least r my_cutoff*100
% of all SNVs in the cohort. We will use
the function LCD_complex_cutoff
instead of LCD
.
general_cutoff_vector <- rep(my_cutoff,dim(current_sig_df)[2]) CosmicValid_cutoffGen_LCDlist <- LCD_complex_cutoff( in_mutation_catalogue_df = lymphomaNature2013_mutCat_df, in_signatures_df = current_sig_df, in_cutoff_vector = general_cutoff_vector, in_sig_ind_df = current_sigInd_df)
At the chosen cutoff of r my_cutoff
, we are left with
r dim(CosmicValid_cutoffGen_LCDlist$out_sig_ind_df)[1]
signatures. We can
look at these signatures in detail and their attributed biological processes:
kable(CosmicValid_cutoffGen_LCDlist$out_sig_ind_df, row.names=FALSE, caption=paste0("Signatures with cohort-wide exposures > ",my_cutoff))
Again we can plot absolute exposures:
exposures_barplot( in_exposures_df = CosmicValid_cutoffGen_LCDlist$exposures, in_signatures_ind_df = CosmicValid_cutoffGen_LCDlist$out_sig_ind_df, in_subgroups_df = COSMIC_subgroups_df)
And relative exposures:
exposures_barplot( in_exposures_df = CosmicValid_cutoffGen_LCDlist$norm_exposures, in_signatures_ind_df = CosmicValid_cutoffGen_LCDlist$out_sig_ind_df, in_subgroups_df = COSMIC_subgroups_df)
When using LCD_complex_cutoff
, we have to supply a vector of cutoffs with as
many entries as there are signatures. In the analysis carried out above, these
were all equal, but this is not a necessary condition. Indeed it may make sense
to provide different cutoffs for different signatures.
specific_cutoff_vector <- general_cutoff_vector specific_cutoff_vector[c(1,5)] <- 0 specific_cutoff_vector
In this example, the cutoff for signatures AC1 and AC5 is thus set to 0,
whereas the cutoffs for all other signatures remains at 0.06. Running the
function LCD_complex_cutoff
is completely analogous:
CosmicValid_cutoffSpec_LCDlist <- LCD_complex_cutoff( in_mutation_catalogue_df = lymphomaNature2013_mutCat_df, in_signatures_df = current_sig_df, in_cutoff_vector = specific_cutoff_vector, in_sig_ind_df = current_sigInd_df)
Plotting absolute exposures for visualization:
exposures_barplot( in_exposures_df = CosmicValid_cutoffSpec_LCDlist$exposures, in_signatures_ind_df = CosmicValid_cutoffSpec_LCDlist$out_sig_ind_df, in_subgroups_df = COSMIC_subgroups_df)
And relative exposures:
exposures_barplot( in_exposures_df = CosmicValid_cutoffSpec_LCDlist$norm_exposures, in_signatures_ind_df = CosmicValid_cutoffSpec_LCDlist$out_sig_ind_df, in_subgroups_df = COSMIC_subgroups_df)
Note that the signatures extracted with the signature-specific cutoffs are the same in the example displayed here. Depending on the analyzed cohort and the choice of cutoffs, the extracted signatures may vary considerably.
To identify groups of samples which were exposed to similar mutational
processes, the exposure vectors of the samples can be compared. The YAPSA
package provides a custom function for this task: complex_heatmap_exposures
,
which uses the package r Biocpkg("ComplexHeatmap")
by Zuguang Gu
[@ComplexHeatmap2015]. It produces output as follows:
complex_heatmap_exposures(CosmicValid_cutoffGen_LCDlist$norm_exposures, COSMIC_subgroups_df, CosmicValid_cutoffGen_LCDlist$out_sig_ind_df, in_data_type="norm exposures", in_subgroup_colour_column="col", in_method="manhattan", in_subgroup_column="subgroup")
If you are interested only in the clustering and not in the heatmap
information, you could also use hclust_exposures
:
hclust_list <- hclust_exposures(CosmicValid_cutoffGen_LCDlist$norm_exposures, COSMIC_subgroups_df, in_method="manhattan", in_subgroup_column="subgroup")
The dendrogram produced by either the function complex_heatmap_exposures
or
the function hclust_exposures
can be cut to yield signature exposure specific
subgroups of the PIDs.
cluster_vector <- cutree(hclust_list$hclust,k=4) COSMIC_subgroups_df$cluster <- cluster_vector subgroup_colour_vector <- rainbow(length(unique(COSMIC_subgroups_df$cluster))) COSMIC_subgroups_df$cluster_col <- subgroup_colour_vector[factor(COSMIC_subgroups_df$cluster)] complex_heatmap_exposures(CosmicValid_cutoffGen_LCDlist$norm_exposures, COSMIC_subgroups_df, CosmicValid_cutoffGen_LCDlist$out_sig_ind_df, in_data_type="norm exposures", in_subgroup_colour_column="cluster_col", in_method="manhattan", in_subgroup_column="cluster")
We will now use the intermutational distances computed above. We set cutoffs for the intermutational distance at 1000 and 100000 bp, leading to three strata. We annotate to every variant in which stratum it falls.
lymphoma_Nature2013_df$density_cat <- cut(lymphoma_Nature2013_df$dist, c(0,1001,100001,Inf), right=FALSE, labels=c("high","intermediate", "background"))
The following table shows the distribution of variants over strata:
temp_df <- data.frame(table(lymphoma_Nature2013_df$density_cat)) names(temp_df) <- c("Stratum","Cohort-wide counts") kable(temp_df, caption=paste0("Strata for the SMC of mutation density"))
We now have everything at hand to carry out a stratified signature analysis:
strata_order_ind <- c(1,3,2) mut_density_list <- run_SMC(lymphoma_Nature2013_df, CosmicValid_cutoffGen_LCDlist$signatures, CosmicValid_cutoffGen_LCDlist$out_sig_ind_df, COSMIC_subgroups_df, column_name="density_cat", refGenome=BSgenome.Hsapiens.UCSC.hg19, cohort_method_flag="norm_PIDs", in_strata_order_ind=strata_order_ind)
This produces a multi-panel figure with
r length(unique(lymphoma_Nature2013_df$density_cat))+1
rows of plots. The
first row visualizes the signature distribution over the whole cohort without
stratification, followed by one row of plots per stratum. Hence in our example
we have four rows of graphs with three (exclusive) strata as input. Each row
consists of three plots. The left plots show absolute exposures in the
respective stratum as stacked barplots on a per sample basis. The middle plots
show relative exposures in the respective stratum on a per sample basis as
stacked barplots. The right plots shows cohort-wide averages of the relative
exposures in the respective stratum. The error bars indicate the standard error
of the mean (SEM).
To test for statistical significance of potential differences in the signature exposures (i.e. signature enrichment and depletion patterns) between the different strata, we can use the Kruskal-Wallis test, as the data is grouped into (potentially more than two) categories and might not follow a normal distribution. As we are testing the different signatures on the same stratification, we have to correct for multiple testing. In order to control the false discovery rate (FDR), the Benjamini-Hochberg correction is appropriate.
stat_mut_density_list <- stat_test_SMC(mut_density_list,in_flag="norm") kable(stat_mut_density_list$kruskal_df, caption=paste0("Results of Kruskal tests for cohort-wide exposures over", " strata per signature without and with correction for ", "multiple testing."))
In the following paragraph we perform post-hoc tests for those signatures where the Kruskal-Wallis test, as evaluated above, has given a significant result. \newpage
significance_level <- 0.05 for(i in seq_len(dim(stat_mut_density_list$kruskal_df)[1])){ if(stat_mut_density_list$kruskal_df$Kruskal_p_val_BH[i]<significance_level){ print(paste0("Signature: ",rownames(stat_mut_density_list$kruskal_df)[i])) print(stat_mut_density_list$kruskal_posthoc_list[[i]]) } }
From this analysis, we can see that a distinct signature enrichment and depletion pattern emerges:
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.