knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) suppressPackageStartupMessages({ library(GenomicRanges) library(ggplot2) library(magrittr) library(periodicDNA) }) BiocParallel::register(setUpBPPARAM(1), default = TRUE)
Short DNA sequence motifs provide key information for interpreting the instructions in DNA, for example by providing binding sites for proteins or altering the structure of the double-helix. A less studied but important feature of DNA sequence motifs is their periodicity. A famous example is the 10-bp periodicity of many k-mers in nucleosome positioning (reviewed in Travers et al. 2010 or in Struhl and Segal 2013).
periodicDNA provides a framework to quantify the periodicity of k-mers of interest in DNA sequences. For a chosen k-mer, periodicDNA can identify which periods are statistically enriched in a set of sequences, by using a randomized shuffling approach to compute an empirical p-value. It can also generate continuous linear tracks of k-mer periodicity strength over genomic loci.
To estimate the periodicity strength of a given k-mer in one or several sequences, periodicDNA performs the following steps:
knitr::include_graphics( "https://raw.githubusercontent.com/js2264/periodicDNA/master/man/figures/periodicityDNA_principle.png" )
The main goal of periodicDNA is to quantify periodicity of a given k-mer in a
set of sequences. For instance, one can assess the periodicity of TT
dinucleotides in sequences around TSSs of ubiquitous promoters using
getPeriodicity()
.
In the following example, getPeriodicity()
is directly ran using
a GRanges object, specifying from which genome this GRanges comes from.
library(ggplot2) library(magrittr) library(periodicDNA) # data(ce11_TSSs) periodicity_result <- getPeriodicity( ce11_TSSs[['Ubiq.']][1:500], genome = 'BSgenome.Celegans.UCSC.ce11', motif = 'TT', BPPARAM = setUpBPPARAM(1) )
The main output of getPeriodicity()
is a table of power spectral density
(PSD) values associated with discrete frequencies, computed using a
Fast Fourier Transform. For a given frequency, a high PSD score
indicates a high periodicity of the k-mer of interest.
In the previous example, TT dinucleotides in sequences around TSSs of ubiquitous promoters are highly periodic, with a periodicity of 10 bp.
head(periodicity_result$PSD) subset(periodicity_result$periodicityMetrics, Period == 10)
Graphical output of getPeriodicity()
can be obtained using the
plotPeriodicityResults()
function:
plotPeriodicityResults(periodicity_result)
The first plot shows the raw distribution of distances between pairs of 'TT' in the sequences of the provided GRanges. The second plot shows the decay-normalised distribution. Finally, the third plot shows the PSD scores of the 'TT' k-mer, measured from the normalised distribution.
periodicDNA provides an approach to compare the periodicity of a given k-mer
in a set of sequences compared to background. For a given k-mer at a period T
in a set of input sequences, the fold-change over background of its PSD
is estimated by iteratively shuffling the input sequences and estimating
the resulting PSD values.
Eventually, the log2 fold-change (l2FC) between the observed PSD and the
median of the PSD values measured after shuffling is computed as follows:
l2FC = log2(observed PSD / median(shuffled PSDs)).
periodicity_result <- getPeriodicity( ce11_TSSs[['Ubiq.']][1:500], genome = 'BSgenome.Celegans.UCSC.ce11', motif = 'TT', n_shuffling = 5 ) head(periodicity_result$periodicityMetrics) subset(periodicity_result$periodicityMetrics, Period == 10) plotPeriodicityResults(periodicity_result)
If n_shuffling >= 100
, an associated empirical p-value is calculated as
well (North et al 2002). This metric indicates, for each individual period T,
whether the observed PSD is significantly greater than the PSD values measured
after shuffling the input sequences. Note that empirical p-values are only an
estimation of the real p-value. Notably, small p-values are systematically
under-estimated (North et al 2002).
getPeriodicity()
can also be ran directly on a set of sequences of interest
as follows:
data(ce11_proms_seqs) periodicity_result <- getPeriodicity( ce11_proms_seqs, motif = 'TT', BPPARAM = setUpBPPARAM(1) ) subset(periodicity_result$periodicityMetrics, Period == 10)
The other aim of periodicDNA is to generate continuous linear tracks of
k-mer periodicity strength over genomic loci of interest.
getPeriodicityTrack()
can be used to generate suck genomic tracks. In the
following example,
WW_10bp_track <- getPeriodicityTrack( genome = 'BSgenome.Celegans.UCSC.ce11', granges = ce11_proms, motif = 'WW', period = 10, BPPARAM = setUpBPPARAM(1), bw_file = 'WW-10-bp-periodicity_over-proms.bw' )
When plotted over sets of ubiquitous, germline or somatic TSSs, the resulting track clearly shows increase of WW 10-bp periodicity above the ubiquitous and germline TSSs, whereas somatic TSSs do not show such increase.
data(ce11_TSSs) plotAggregateCoverage( WW_10bp_track, ce11_TSSs, xlab = 'Distance from TSS', ylab = '10-bp periodicity strength (forward proms.)' )
knitr::include_graphics( "https://raw.githubusercontent.com/js2264/periodicDNA/master/man/figures/TT-10bp-periodicity_tissue-spe-TSSs.png" )
sessionInfo()
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.