knitr::opts_chunk$set( collapse = TRUE, comment = "#>", fig.width = 7, fig.height = 5, out.width = "80%", fig.align = "center", crop = NULL # suppress "The magick package is required to crop" issue ) library(BiocStyle)
r Biocpkg("monaLisa")
is a collection of functions for working with
biological sequences
and motifs that represent the binding preferences of transcription factors or
nucleic acid binding proteins.
For example, r Biocpkg("monaLisa")
can be used to conveniently find motif
hits in sequences (see section \@ref(findhits)), or to identify motifs that
are likely associated with observed experimental data. Such analyses are
supposed to provide potential answers to the question "Which transcription
factors are the drivers of my observed changes in
expression/methylation/accessibility?".
Several other approaches have been described that also address this problem,
among them REDUCE [@reduce], AME [@ame] and ISMARA [@ismara]. In
r Biocpkg("monaLisa")
, we aim to provide a flexible implementation that
integrates well with other Bioconductor resources, makes use of the sequence
composition correction developed for Homer [@homer] or stability selection
[@StabSel] and provides several alternative ways to study the relationship
between experimental measurements and sequence motifs.
You can use known motifs from collections of transcription factor binding
specificities such as r Biocpkg("JASPAR2020")
, also available from
Bioconductor. Genomic regions could be for example promoters, enhancers or
accessible regions for which experimental data is available.
Two independent approaches are implemented to identify interesting motifs:
In the binned motif enrichment analysis (calcBinnedMotifEnrR
, see section
\@ref(binned)), genomic regions are grouped into bins according to a numerical
value assigned to each region, such as the change in expression, accessibility
or methylation. Motif enrichments are then calculated for each bin, normalizing
for differences in sequence composition in a very similar way as originally done
by Homer [@homer]. As a special case,
the approach can also be used to do a simple two set comparison (foreground
against background sequences, see section \@ref(binary)) or to determine motif
enrichments in a single set of sequences compared to a suitably matched genomic
background set (see section \@ref(vsgenome)). The binned motif enrichment
approach was first introduced in @ginno2018 and subsequently applied in
e.g. @barisic2019. To see more details on how calcBinnedMotifEnrR
resembles
Homer
, check the function help page. We recommend using this function to do
the binned motif enrichment analysis, since it corrects for sequence composition
differences similarly to Homer
, but is implemented more efficiently.
calcBinnedMotifEnrHomer
implements the same analysis using Homer and therefore
requires a local installation of
Homer, and
calcBinnedKmerEnr
(see section \@ref(binnedkmers)) implements the analysis for
k-mers instead of motifs, to study sequence enrichments without the requirement
of known motifs.
Randomized Lasso stability selection (randLassoStabSel
, see
r Biocpkg("monaLisa", vignette="selecting_motifs_with_randLassoStabSel.html", label="the stability selection vignette")
in r Biocpkg("monaLisa")
) uses a robust regression approach (stability
selection, @StabSel) to predict what transcription factors can explain
experimental measurements, for example changes in chromatin accessibility
between two conditions. Also this approach allows to correct for sequence
composition. In addition, similar motifs have to "compete" with each other to be
selected.
For both approaches, functions that allow visualization of obtained results are provided.
If you prefer to jump right in, you can continue with section \@ref(quick) that shows a quick hypothetical example of how to run a binned motif enrichment analysis. If you prefer to actually compute enrichments on real data, you can find below a detailed example for a binned motif enrichment analysis (section \@ref(binned)). The special cases of analyzing just two sets of sequences (binary motif enrichment analysis) or a single set of sequences (comparing it to a suitable background sampled from the genome) are illustrated in section \@ref(nobins).
r Biocpkg("monaLisa")
can be installed from Bioconductor via the
r CRANpkg("BiocManager")
package:
if (!requireNamespace("BiocManager", quietly = TRUE)) install.packages("BiocManager") BiocManager::install("monaLisa")
The quick example below, which we do not run, illustrates how a binned motif
enrichment analysis can be performed in r Biocpkg("monaLisa")
. We assume that
you already have a set of peaks. The sequences of the peak regions are stored in
a Biostrings::DNAStringSet
object (peak_seqs
), and additionally each peak is
associated with a numeric value (e.g., the change of methylation between two
conditions, stored in the peak_change
vector), that will be used to bin the
regions before finding motifs enriched in each bin.
# load package library(monaLisa) # bin regions # - peak_change is a numerical vector # - peak_change needs to be created by the user to run this code peak_bins <- bin(x = peak_change, binmode = "equalN", nElement = 400) # calculate motif enrichments # - peak_seqs is a DNAStringSet, pwms is a PWMatrixList # - peak_seqs and pwms need to be created by the user to run this code se <- calcBinnedMotifEnrR(seqs = peak_seqs, bins = peak_bins, pwmL = pwms)
The returned se
is a SummarizedExperiment
with assays
negLog10P,
negLog10Padj, pearsonResid, expForegroundWgt, log2enr,
sumForegroundWgtWithHits and sumBackgroundWgtWithHits, each containing a
matrix with motifs (rows) by bins (columns). The values are:
background
argument (by default: sequences in
all other bins). p.adjust.method
argument, by default:
Benjamini and Hochberg, 1995 (p.adjust(..., method="fdr")
). pseudocount.log2enr
argument. In addition, rowData(se)
and colData(se)
give information about the used
motifs and bins, respectively. In metadata(se)
you can find information
about parameter values.
This section illustrates the use of r Biocpkg("monaLisa")
to analyze regions
or sequences with associated numerical values (here: changes of DNA
methylation), grouped into several bins according to these values. The
special cases of just two sets of sequences (binary motif enrichment analysis)
or a single set of sequences (comparing it to a suitable background sampled from
the genome) are illustrated in section \@ref(nobins).
This example is based on experimental data from an in vitro differentiation system, in which mouse embryonic stem (ES) cells are differentiated into neuronal progenitors (NP). In an earlier study [@LMRs], we have analyzed the genome-wide CpG methylation patterns in these cell types and identified so called low methylated regions (LMRs), that have reduced methylation levels and correspond to regions bound by transcription factors.
We also developed a tool that systematically identifies such regions from genome-wide methylation data [@MethylSeekR]. Interestingly, a change in methylation of LMRs is indicative of altered transcription factor binding. We will therefore use these regions to identify transcription factor motifs that are enriched or depleted in LMRs that change their methylation between ES and NP cell states.
We start by loading the needed packages:
library(GenomicRanges) library(SummarizedExperiment) library(JASPAR2020) library(TFBSTools) library(BSgenome.Mmusculus.UCSC.mm10) library(monaLisa) library(ComplexHeatmap) library(circlize)
r Biocpkg("monaLisa")
provides a file with genomic coordinates (mouse mm10
assembly) of LMRs, with the respective changes of methylation. We load this
GRanges
object into R
.
lmrfile <- system.file("extdata", "LMRsESNPmerged.gr.rds", package = "monaLisa") lmr <- readRDS(lmrfile) lmr
Alternatively, the user may also start the analysis with genomic regions
contained in a bed
file, or directly with sequences in a FASTA
file.
The following example code illustrates how to do this, but should not be
run if you are following the examples in this vignette.
# starting from a bed file # import as `GRanges` using `rtracklayer::import` # remark: if the bed file also contains scores (5th column), these will be # also be imported and available in the "score" metadata column, # in this example in `lmr$score` lmr <- rtracklayer::import(con = "file.bed", format = "bed") # starting from sequences in a FASTA file # import as `DNAStringSet` using `Biostrings::readDNAStringSet` # remark: contrary to the coordinates in a `GRanges` object like `lmr` above, # the sequences in `lmrseqs` can be directly used as input to # monaLisa::calcBinnedMotifEnrR (no need to extract sequences from # the genome, just skip that step below) lmrseqs <- Biostrings::readDNAStringSet(filepath = "myfile.fa", format = "fasta")
We can see there are r length(lmr)
LMRs, most of which gain methylation
between ES and NP stages:
hist(lmr$deltaMeth, 100, col = "gray", main = "", xlab = "Change of methylation (NP - ES)", ylab = "Number of LMRs")
In order to keep the computation time reasonable, we'll select 10,000 of the LMRs randomly:
set.seed(1) lmrsel <- lmr[ sample(x = length(lmr), size = 10000, replace = FALSE) ]
Now let's bin our LMRs by how much they change methylation, using the bin
function from r Biocpkg("monaLisa")
. We are not interested in small changes of
methylation, say less than 0.3, so we'll use the minAbsX
argument to create a
no-change bin in [-0.3, 0.3). The remaining LMRs are put into bins of 800
each:
bins <- bin(x = lmrsel$deltaMeth, binmode = "equalN", nElement = 800, minAbsX = 0.3) table(bins)
Generally speaking, we recommend a minimum of ~100 sequences per bin as fewer sequences may lead to small motif counts and thus either small or unstable enrichments.
We can see which bin has been set to be the zero bin using getZeroBin
, or set
it to a different bin using setZeroBin
:
# find the index of the level representing the zero bin levels(bins) getZeroBin(bins)
Because of the asymmetry of methylation changes, there is only a single bin with LMRs that lost methylation and many that gained:
plotBinDensity(lmrsel$deltaMeth, bins, legend = "topleft")
Note that the bin breaks around the no-change bin are not exactly -0.3 to 0.3.
They have been adjusted to have the required 800 LMRs per bin below and above
it. r Biocpkg("monaLisa")
will give a warning if the adjusted bin breaks are
strongly deviating from the requested minAbsX
value, and bin(..., model =
"breaks")
can be used in cases where exactly defined bin boundaries are
required.
Next we prepare the motif enrichment analysis. We first need known motifs
representing transcription factor binding site preferences. We extract all
vertebrate motifs from the r Biocpkg("JASPAR2020")
package as positional
weight matrices (PWMs):
pwms <- getMatrixSet(JASPAR2020, opts = list(matrixtype = "PWM", tax_group = "vertebrates"))
Furthermore, we need the sequences corresponding to our LMRs. As sequences in one bin are compared to the sequences in other bins, we would not want differences of sequence lengths or composition between bins that might bias our motif enrichment results.
In general, we would recommend to use regions of similar or even equal lengths
to avoid a length bias, for example by using a fixed-size region around the
midpoint of each region of interest using GenomicRanges::resize
. In addition,
the resized regions may have to be constrained to the chromosome boundaries
using trim:
summary(width(lmrsel)) lmrsel <- trim(resize(lmrsel, width = median(width(lmrsel)), fix = "center")) summary(width(lmrsel))
We can now directly extract the corresponding sequences from the
r Biocpkg("BSgenome.Mmusculus.UCSC.mm10")
package (assuming you have started
the analysis with genomic regions - if you already have sequences, just skip
this step)
lmrseqs <- getSeq(BSgenome.Mmusculus.UCSC.mm10, lmrsel)
and check for differences in sequence composition between bins using
the plotBinDiagnostics
function. "GCfrac" will plot the
distributions of the fraction of G+C bases, and "dinucfreq" creates a
heatmap of average di-nucleotide frequencies in each bin, relative to
the overall average.
plotBinDiagnostics(seqs = lmrseqs, bins = bins, aspect = "GCfrac") plotBinDiagnostics(seqs = lmrseqs, bins = bins, aspect = "dinucfreq")
From these plots, we can see that LMRs with lower methylation in NP cells
compared to ES cells (bin r levels(bins)[1]
) tend to be GC-poorer than
LMRs in other bins. A strong bias of this kind could give rise to false
positives in that bin, e.g. enrichments of AT-rich motifs.
At this point in the analysis, it is difficult to decide if this bias should be
addressed here (for example by subsampling sequences of more comparable GC
composition), or if the bias can be ignored because the built-in sequence
composition correction in calcBinnedMotifEnrR
will be able to
account for it. Our recommendation would be to take a mental note at this point
and remember that sequences in the r levels(bins)[1]
bin tend to be GC-poorer.
Later, we should check if AT-rich motifs are specifically enriched in that bin,
and if that is the case, we should critically assess if that result is robust
and can be reproduced in an analysis that uses more balanced sequences in all
bins, or an analysis with background = "genome"
. The show_motif_GC
and
show_seqlogo
arguments of plotMotifHeatmaps
can help to visually
identify motif sequence composition in an enrichment result (see below).
Finally, we run the binned motif enrichment analysis.
This step will take a while, and typically you would use the BPPARAM
argument to run it with parallelization using n
cores as follows:
calcBinnedMotifEnrR(..., BPPARAM = BiocParallel::MulticoreParam(n)))
.
For this example however, you can skip over the next step and just load the
pre-computed results as shown further below.
se <- calcBinnedMotifEnrR(seqs = lmrseqs, bins = bins, pwmL = pwms)
In case you did not run the above code, let's now read in the results:
se <- readRDS(system.file("extdata", "results.binned_motif_enrichment_LMRs.rds", package = "monaLisa"))
se
is a SummarizedExperiment
object which nicely keeps motifs, bins and
corresponding metadata together:
# summary se dim(se) # motifs-by-bins # motif info rowData(se) head(rownames(se)) # bin info colData(se) head(colnames(se)) # assays: the motif enrichment results assayNames(se) assay(se, "log2enr")[1:5, 1:3]
We can plot the results using the plotMotifHeatmaps
function, e.g. selecting
all transcription factor motifs that have a $-log_{10} FDR$ of at least 4.0 in
any bin (corresponding to an $FDR < 10^{-4}$). FDR values are stored in the
negLog10Padj
assay:
# select strongly enriched motifs sel <- apply(assay(se, "negLog10Padj"), 1, function(x) max(abs(x), 0, na.rm = TRUE)) > 4.0 sum(sel) seSel <- se[sel, ] # plot plotMotifHeatmaps(x = seSel, which.plots = c("log2enr", "negLog10Padj"), width = 2.0, cluster = TRUE, maxEnr = 2, maxSig = 10, show_motif_GC = TRUE)
In order to select only motifs with significant enrichments in a specific bin, or in any bin except the "zero" bin, you could use:
# significantly enriched in bin 8 levels(bins)[8] sel.bin8 <- assay(se, "negLog10Padj")[, 8] > 4.0 sum(sel.bin8, na.rm = TRUE) # significantly enriched in any "non-zero" bin getZeroBin(bins) sel.nonZero <- apply( assay(se, "negLog10Padj")[, -getZeroBin(bins), drop = FALSE], 1, function(x) max(abs(x), 0, na.rm = TRUE)) > 4.0 sum(sel.nonZero)
Setting cluster = TRUE
in plotMotifHeatmaps
has re-ordered the rows using
hierarchical clustering of the pearsonResid
assay. As many transcription
factor binding motifs are similar to each other, it is also helpful to show the
enrichment heatmap clustered by motif similarity. To this end, we first
calculate all pairwise motif similarities (measured as the maximum Pearson
correlation of all possible shifted alignments). This can be quickly calculated
for the few selected motifs using the motifSimilarity
function. For many
motifs, this step may take a while, and it may be useful to parallelize it using
the BPPARAM
argument (e.g. to run on n
parallel threads using the multi-core
backend, you can use: motifSimilarity(..., BPPARAM =
BiocParallel::MulticoreParam(n))
).
SimMatSel <- motifSimilarity(rowData(seSel)$motif.pfm) range(SimMatSel)
The order of the TFs in the resulting matrix is consistent with the elements of
seSel
, and the maximal similarity between any pair of motifs is 1.0. By
subtracting these similarities from 1.0, we obtain distances that we use
to perform a hierarchical clustering with the stats::hclust
function. The
returned object (hcl
) is then passed to the cluster
argument of
plotMotifHeatmaps
to define the order of the rows in the heatmap. The plotting
of the dendrogram is controlled by the argument show_dendrogram
, and we also
display the motifs as sequence logos using show_seqlogo
:
# create hclust object, similarity defined by 1 - Pearson correlation hcl <- hclust(as.dist(1 - SimMatSel), method = "average") plotMotifHeatmaps(x = seSel, which.plots = c("log2enr", "negLog10Padj"), width = 1.8, cluster = hcl, maxEnr = 2, maxSig = 10, show_dendrogram = TRUE, show_seqlogo = TRUE, width.seqlogo = 1.2)
We have seen above that sequences in the r levels(bins)[1]
bin (first column
from the left in the heatmap) were GC-poorer than the sequences in other bins.
While some of the enriched motifs in that bin are not GC-poor (for example RARA,
NR2F1 and similar motifs), other more weakly enriched motifs are clearly AT-rich
(for example HOX family motifs). To verify that these are not false positive
results, the motif analysis should be repeated after sequences have been
subsampled in each bin to have similar GC composition in all bins, or with
calcBinnedMotifEnrR(..., background = "genome")
. The latter is illustrated in
section \@ref(vsgenome).
Homer
and motif objects in R
{#motifConvert}r Biocpkg("monaLisa")
provides two functions for performing binned motif
enrichment analysis (calcBinnedMotifEnrR
and calcBinnedMotifEnrHomer
).
calcBinnedMotifEnrR
implements the binned motif enrichment analysis in R
,
similarly to Homer
, and does not require the user to have the Homer
tool
pre-installed. For more information on that function and how it resembles the
Homer
tool see the function documentation.
A simple way to represent a DNA sequence motif that assumes independence of
positions in the motif is a matrix with four rows (for the bases A, C, G and T)
and n
columns for the n
positions in the motif. The values in that matrix
can represent the sequence preferences of a binding protein in several
different ways:
R
, PFMs are often represented using
TFBSTools::PFMatrix
(single motif) or TFBSTools::PFMatrixList
(set of
motifs) objects. This is the rawest way to represent a sequence motif and can be
converted into any other representation.Homer
. A PPM
can only be converted back to a PFM by knowing or assuming how many binding
site sequences were observed (see argument n
in homerToPFMatrixList
). TFBSTools::toPWM
for details). This is a useful representation for scanning
sequences for motif matches. In R
, PWMs are often represented using
TFBSTools::PWMatrix
(single motif) or TFBSTools::PWMatrixList
(set of
motifs).calcBinnedMotifEnrR
takes PWMs as a TFBSTools::PWMatrixList
object to scan
for motif hits. calcBinnedMotifEnrHomer
on the other hand takes a motif text
file with PPMs, and requires the user to have Homer
installed to use it for
the binned motif enrichment analysis. Here, we show how one can get motif PFMs
from r Biocpkg("JASPAR2020")
and convert them to a Homer
-compatible text
file with PPMs (dumpJaspar
) and vice versa (homerToPFMatrixList
), and how
to convert a TFBSTools::PFMatrixList
to a TFBSTools::PWMatrixList
for use
with calcBinnedMotifEnrR
or findMotifHits
:
# get PFMs from JASPAR2020 package (vertebrate subset) pfms <- getMatrixSet(JASPAR2020, opts = list(matrixtype = "PFM", tax_group = "vertebrates")) # convert PFMs to PWMs pwms <- toPWM(pfms) # convert JASPAR2020 PFMs (vertebrate subset) to Homer motif file tmp <- tempfile() convert <- dumpJaspar(filename = tmp, pkg = "JASPAR2020", pseudocount = 0, opts = list(tax_group = "vertebrates")) # convert Homer motif file to PFMatrixList pfms_ret <- homerToPFMatrixList(filename = tmp, n = 100L) # compare the first PFM # - notice the different magnitude of counts (controlled by `n`) # - notice that with the default (recommended) value of `pseudocount = 1.0`, # there would be no zero values in pfms_ret matrices, making # pfms and pfms_ret even more different as.matrix(pfms[[1]]) as.matrix(pfms_ret[[1]]) # compare position probability matrices with the original PFM round(sweep(x = as.matrix(pfms[[1]]), MARGIN = 2, STATS = colSums(as.matrix(pfms[[1]])), FUN = "/"), 3) round(sweep(x = as.matrix(pfms_ret[[1]]), MARGIN = 2, STATS = colSums(as.matrix(pfms_ret[[1]])), FUN = "/"), 3)
In some cases, we are interested in identifying enriched motifs between just two
sets of sequences (binary motif enrichment), for example between ATAC peaks with
increased and decreased accessibility. Numerical values that could be used for
grouping the regions in multiple bins may not be available. Or we may be
interested in analyzing just a single set of sequences (for example a set of
ChIP-seq peaks), relative to some neutral background. In this section, we show
how such binary or single-set motif enrichment analyses can be performed using
r Biocpkg("monaLisa")
.
The binary motif enrichment analysis is a simple special case of the general binned motif analysis described in section \@ref(binned), where the two sets to be compared are defining the two bins.
Let's re-use the DNA methylation data from section \@ref(binned) and assume that
we just want to compare the sequences that don't show large changes in their
methylation levels (lmr.unchanged
, changes smaller than 5%) to those that
gain more than 60% methylation (lmr.up
):
lmr.unchanged <- lmrsel[abs(lmrsel$deltaMeth) < 0.05] length(lmr.unchanged) lmr.up <- lmrsel[lmrsel$deltaMeth > 0.6] length(lmr.up)
As before, we need a single sequence object (lmrseqs2
, which is a
DNAStringSet
) that we obtain by combining these two groups into a single
GRanges
object (lmrsel2
) and extract the corresponding sequences from the
genome (lmrseqs2
). If you already have two sequence objects, they can be just
concatenated using lmrseqs2 <- c(seqs.group1, seqs.group2)
.
# combine the two sets or genomic regions lmrsel2 <- c(lmr.unchanged, lmr.up) # extract sequences from the genome lmrseqs2 <- getSeq(BSgenome.Mmusculus.UCSC.mm10, lmrsel2)
Finally, we manually create a binning factor (bins2
) that defines the group
membership for each element in lmrseqs2
:
# define binning vector bins2 <- rep(c("unchanged", "up"), c(length(lmr.unchanged), length(lmr.up))) bins2 <- factor(bins2) table(bins2)
Now we can run the binned motif enrichment analysis. To keep the calculation
time short, we will just run it on the motifs that we had selected above in
seSel
:
se2 <- calcBinnedMotifEnrR(seqs = lmrseqs2, bins = bins2, pwmL = pwms[rownames(seSel)]) se2
We visualize the results for motifs that are enriched in one of the
two groups with an adjusted p value of less than $10^{-4}$ (the order of the
columns in the heatmap is defined by the order of the factor levels in bins2
,
given by levels(bins2)
and can also be obtained from colnames(se2)
; here it
is r levels(bins2)
):
sel2 <- apply(assay(se2, "negLog10Padj"), 1, function(x) max(abs(x), 0, na.rm = TRUE)) > 4.0 sum(sel2) plotMotifHeatmaps(x = se2[sel2,], which.plots = c("log2enr", "negLog10Padj"), width = 1.8, cluster = TRUE, maxEnr = 2, maxSig = 10, show_seqlogo = TRUE)
Motif enrichments can also be obtained from a single set of genomic regions or sequences (foreground set), by comparing it to a suitable background set. A suitable background set could be for example sequences with a similar sequence composition that are randomly selected from the same genome, or sequences obtained by randomization of the foreground sequences by shuffling or permutation.
A noteworthy package in this context is r Biocpkg("nullranges")
that focuses
on the selection of such background ranges (representing the null hypothesis),
for example controlling for confounding covariates like GC composition. After
a suitable background set has been identified using r Biocpkg("nullranges")
,
a binary motif enrichment analysis as described in section \@ref(binary) can
be performed. Manually defining the background set is recommended to control for
covariates other than GC composition and to get access to the selected
background sequences, for example to verify if they are indeed similar to the
foreground sequences for those covariates.
A quick alternative with less flexibility in the background set definition is
available directly in r Biocpkg("monaLisa")
, by using
calcBinnedMotifEnrR(..., background = "genome")
. This will select the
background set by randomly sampling sequences from the genome (given by the
genome
argument, optionally restricted to the intervals defined in the
genome.regions
argument). For each foreground sequence, genome.oversample
background sequences of the same size (on average) are sampled. From these, one
per foreground sequence is selected trying to best match its G+C composition.
We apply this simple approach here to check if the motif enrichments identified
in section \@ref(binned) could be in part false positives due to the GC-poor
first bin (r levels(bins)[1]
, see above).
Let's first obtain the sequences from that bin (lmrseqs3
), and then run
calcBinnedMotifEnrR
comparing to a genome background. In order to
make the sampling reproducible, we are seeding the random number generator
inside the BPPARAM
object. Also, to speed up the calculation, we will only
include the motifs we had selected above in seSel
:
lmrseqs3 <- lmrseqs[bins == levels(bins)[1]] length(lmrseqs3) se3 <- calcBinnedMotifEnrR(seqs = lmrseqs3, pwmL = pwms[rownames(seSel)], background = "genome", genome = BSgenome.Mmusculus.UCSC.mm10, genome.regions = NULL, # sample from full genome genome.oversample = 2, BPPARAM = BiocParallel::SerialParam(RNGseed = 42), verbose = TRUE)
Note that we did not have to provide a bins
argument, and that the result will
only have a single column, corresponding to the single set of sequences that we
analyzed:
ncol(se3)
When we visualize motifs that are enriched with an adjusted p value of less than
$10^{-4}$, we still find AT-rich motifs significantly enriched, including the
HOX family motifs that were weakly enriched in seSel
but for which it was
unclear if their enrichment was driven by the AT-rich (GC-poor) sequences in
that specific bin. The fact that this motif family is still robustly identified
when using a GC-matched genomic background supports that it may be a real
biological signal.
sel3 <- assay(se3, "negLog10Padj")[, 1] > 4.0 sum(sel3) plotMotifHeatmaps(x = se3[sel3,], which.plots = c("log2enr", "negLog10Padj"), width = 1.8, maxEnr = 2, maxSig = 10, show_seqlogo = TRUE) # analyzed HOX motifs grep("HOX", rowData(se3)$motif.name, value = TRUE) # significant HOX motifs grep("HOX", rowData(se3)$motif.name[sel3], value = TRUE)
A comparison of log2 motif enrichments between the background = "otherBins"
and background = "genome"
analyses also supports this conclusion: The HOX
family motifs (shown in red) are similarly enriched in both analyses, while the
depletion of GC-rich KLF family motifs (shown in green) is less pronounced in
background = "genome"
and thus more sensitive to the used background. The
depletion of KLF family motifs may thus be an example of an incorrect result,
although note that the depletion was not significant in either of the two
analyses:
cols <- rep("gray", nrow(se3)) cols[grep("HOX", rowData(se3)$motif.name)] <- "#DF536B" cols[grep("KLF|Klf", rowData(se3)$motif.name)] <- "#61D04F" par(mar = c(5, 5, 2, 2) + .1, mgp = c(1.75, 0.5, 0), cex = 1.25) plot(assay(seSel, "log2enr")[,1], assay(se3, "log2enr")[,1], col = cols, pch = 20, asp = 1, xlab = "Versus other bins (log2 enr)", ylab = "Versus genome (log2 enr)") legend("topleft", c("HOX family","KLF family","other"), pch = 20, bty = "n", col = c("#DF536B", "#61D04F", "gray")) abline(a = 0, b = 1) abline(h = 0, lty = 3) abline(v = 0, lty = 3)
In some situations it may be beneficial to perform the enrichment analysis in a more 'unbiased' way, using k-mers rather than annotated motifs. Here, we will illustrate the process using the same LMR data set as used for the motif enrichment analysis above in section \@ref(binned). Similarly to the motif enrichment, this step takes a while to perform, and we can also skip over the next step and load the processed object directly.
sekm <- calcBinnedKmerEnr(seqs = lmrseqs, bins = bins, kmerLen = 6, includeRevComp = TRUE)
sekm <- readRDS(system.file( "extdata", "results.binned_6mer_enrichment_LMRs.rds", package = "monaLisa" ))
Just as for the motif enrichment analysis, the return value is a
SummarizedExperiment
object, with the same set of assays and annotations.
sekm
As for the motif enrichment, we can extract any k-mer that is enriched in any of the bins.
selkm <- apply(assay(sekm, "negLog10Padj"), 1, function(x) max(abs(x), 0, na.rm = TRUE)) > 4 sum(selkm) sekmSel <- sekm[selkm, ]
Next, let's compare the enriched k-mers to the motifs that were found earlier.
This can be done using the motifKmerSimilarity
function. By showing the
similarity between the enriched k-mers and motifs, we can see whether, e.g.,
strongly enriched k-mers do not seem to correspond to an annotated motif.
pfmSel <- rowData(seSel)$motif.pfm sims <- motifKmerSimilarity(x = pfmSel, kmers = rownames(sekmSel), includeRevComp = TRUE) dim(sims) maxwidth <- max(sapply(TFBSTools::Matrix(pfmSel), ncol)) seqlogoGrobs <- lapply(pfmSel, seqLogoGrob, xmax = maxwidth) hmSeqlogo <- rowAnnotation(logo = annoSeqlogo(seqlogoGrobs, which = "row"), annotation_width = unit(1.5, "inch"), show_annotation_name = FALSE ) Heatmap(sims, show_row_names = TRUE, row_names_gp = gpar(fontsize = 8), show_column_names = TRUE, column_names_gp = gpar(fontsize = 8), name = "Similarity", column_title = "Selected TFs and enriched k-mers", col = colorRamp2(c(0, 1), c("white", "red")), right_annotation = hmSeqlogo)
r Biocpkg("monaLisa")
to annotate genomic regions with predicted motifs{#findhits}As mentioned, r Biocpkg("monaLisa")
can also be used to scan sequences for
motifs. For a quick description of motif representations see
section \@ref(motifConvert). Here is an example (just on a few
sequences/motifs for illustration):
# get sequences of promoters as a DNAStringSet # (the `subject` of `findMotifHits` could also be a single DNAString, # or the name of a fasta file) library(TxDb.Mmusculus.UCSC.mm10.knownGene) gr <- trim(promoters(TxDb.Mmusculus.UCSC.mm10.knownGene, upstream = 1000, downstream = 500)[c(1, 4, 5, 10)]) library(BSgenome.Mmusculus.UCSC.mm10) seqs <- getSeq(BSgenome.Mmusculus.UCSC.mm10, gr) seqs # get motifs as a PWMatrixList # (the `query` of `findMotifHits` could also be a single PWMatrix, # or the name of a motif file) library(JASPAR2020) library(TFBSTools) pfms <- getMatrixByID(JASPAR2020, c("MA0885.1", "MA0099.3", "MA0033.2", "MA0037.3", "MA0158.1")) pwms <- toPWM(pfms) pwms name(pwms) # predict hits in sequences res <- findMotifHits(query = pwms, subject = seqs, min.score = 6.0, method = "matchPWM", BPPARAM = BiocParallel::SerialParam()) res # create hit matrix: # number of sites of each motif per sequence m <- table(factor(seqnames(res), levels = names(seqs)), factor(res$pwmname, levels = name(pwms))) m
The transformation of sequence and PWM names to factors with defined levels
in the creation of the hit matrix above is not strictly needed, but it ensures
that even sequences or motifs without any hits are reported in the matrix,
and that the order of sequences (rows) and motifs (columns) is identical to
the order in seqs
and pwms
.
The monaLisa logo uses a drawing that was obtained from http://vectorish.com/lisa-simpson.html under the Creative Commons attribution - non-commercial 3.0 license: https://creativecommons.org/licenses/by-nc/3.0/.
This vignette was built using:
sessionInfo()
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.