Once the experiment is generated, we can create an interactive html report containing basic summary statistics of the scMethrix
object with scMethrix_report
function.
The report can be accessed here: Initial reports
methrix::methrix_report(meth = meth, output_dir = getwd(), prefix = "initial")
In the report, we can check genome-wide and chromosome based statistics on coverage and methylation, in order to identify potential quality issues, for example:
samples with too high or too low genome-wide coverage or on a specific chromosome. In this case, there is also a possibility, that the sample is affected by copy number alterations. Both methylation levels and coverage therefore shows higher variation among cancer samples.
samples with altered beta level distribution.
A "bump" on the beta level density plot at 0.5 indicates the presence of single nucleotide polymorphisms (SNPs) -> see later.
To ensure the high quality of our dataset, the sites with very low coverage (untrustworthy methylation level) or high average coverage (technical problem) should not be used. We can mask these CpG sites. Please note, that the DSS we were using for Differential Methylation Region (DMR) calling, doesn't need the removal of the lowly covered sites, because it takes it into account by analysis. However, using a not too restrictive threshold won't interfere with DMR calling.
# Masking CpG sites with a coverage < 2 and an average coverage > 2 meth <- scMethrix::mask_by_coverage(meth, low_threshold = 2, avg_threshold = 2)
If the coverage matrix is absent, filtering can be done via sample count instead. Similar to mask_by_coverage
, low_threshold
masks be low sample count, whereas prop_threshold
will mask sites that appear in less than a certain proportion of all input samples.
# Masking CpG sites present in fewer than 2 samples meth <- scMethrix::mask_by_sample(meth, low_threshold = 2)
# Masking CpG sites present in fewer than 5% of samples meth <- scMethrix::mask_by_sample(meth, prop_threshold = 0.05)
CpG sites with a low inter-individual methylation variance can increase the risk of false discoveries and reduce statistical significance. We can mask the sites to prevent this, as well as decreasing computation time and data size. The function mask_by_variance
will calculate the variance of each CpG site and mask it if below a specified threshold.
# Masking sites with a variance < 0.05 meth <- scMethrix::mask_by_variance(meth, low_threshold = 0.05)
With SNPs, the C > T mutations can disrupt methylation calling. Therefore, it is essential to remove CpG sites that overlap with common variants, if we have variation in our study population. For example working with human, unpaired data, e.g. treated vs. untreated groups.
During filtering, we can select the population of interest and the minimum minor allele frequency (MAF) of the SNPs.
SNP filtering is currently implemented for hg19 and hg38 assemblies. For mm10 assemblies, if the same strain of mice is used, SNP filtering is not generally necessary due to high homozygosity between individuals. SNP data for mice can be downloaded from here: https://www.sanger.ac.uk/science/data/mouse-genomes-project.
if(!requireNamespace("GenomicScores")) { BiocManager::install("GenomicScores") } library(GenomicScores)
meth <- scMethrix::remove_snps(meth, keep = TRUE)
if (!requireNamespace("pheatmap", quietly = TRUE)) install.packages("pheatmap") snps <- meth[[2]] meth <- meth[[1]] pheatmap::pheatmap(get_matrix(order_by_sd(snps)[1:min(5000, length(snps)),]))
After masking is completed, we can remove those sites that are not covered by any of the samples.
meth <- scMethrix::remove_uncovered(meth)
Often, whole genome analysis using all samples is not necessary. This package includes multiple convenient options for querying - regions
, contigs
, or samples
- and can processed as either inclusive or exclusive. Multiple query types can be given in each command. Subset operations in scMethrix
make use of the fast binary search data.table
that is several orders faster than bsseq or other similar packages.
# Subsetting to include only two samples meth <- scMethrix::subset_methrix(meth, contigs = c("GSM2553093", "GSM2553095"), by="include") colnames(inc)
# Subsetting to exclude two samples meth <- scMethrix::subset_methrix(meth, contigs = c("GSM2553093", "GSM2553095"), by="exclude") colnames(meth)
# Subsetting to include only chr1 and chr2 meth <- scMethrix::subset_methrix(meth, contigs = c("chr1", "chr2")) meth GenomicRanges::seqinfo(SummarizedExperiment::rowRanges(meth))
# Subsetting to include only bases 1:10M in chr1 regions <- GenomicRanges::GRanges(seqnames = "chr1", ranges = IRanges(1,10000000)) meth <- scMethrix::subset_methrix(meth, regions = regions) meth GenomicRanges::seqlengths(SummarizedExperiment::rowRanges(meth))
After filtering, it worth running a report again, to see if any of the samples were so lowly covered that it warrants an action (e.g. removal of the sample)
Important: Don't forget to set the recal_stats
to TRUE
, since the object changed since reading in. The output directory has to be different from last time, to avoid using the intermediate files calculated during the last run.
methrix::methrix_report(meth = meth, output_dir = getwd(), recal_stats = TRUE, prefix="processed")
The report can be found here: processed_methrix_reports.html
There are additional possibilities to visualize the study. We can look at the density of coverage both sample- and group-wise:
scMethrix::plot_coverage(meth, type = "dens")
scMethrix::plot_coverage(meth, type = "dens", pheno = "Day", perGroup = TRUE)
We can visualize te beta value distribution as a violin plot or density plot. These plots (as well as the coverage plot) use the 25000 most variable CpG sites to ensure fast processing. plot_density
and plot_violin
has the option to use data only from a restricted region.
scMethrix::plot_density(meth)
scMethrix::plot_density(meth, ranges = GRanges("chr1:1-10000000"))
scMethrix::plot_violin(meth)
scMethrix
offers principal component analysis (PCA), uniform manifold approximation and projection (UMAP), and t-distributed stochastic neighbor embedding (tSNE) to conduct and visualize dimensionality reduction. The resulting reduction is stored in the scMethrix object, and can be accessed via reducedDim().
First, the model is calculated, then a number of sites are selected for plotting, either random or based on variance (var
option). This ensures that the calculations remain feasible.
meth <- scMethrix::dim_red_scMethrix(meth,type="PCA",top_var = 10000, n_components = 20) plot_dim_red(meth,type="PCA")
meth <- scMethrix::dim_red_scMethrix(meth,type="UMAP",top_var = 10000, n_neighbors = 3) plot_dim_red(meth,type="UMAP")
meth <- scMethrix::dim_red_scMethrix(meth,type="tSNE",top_var = 10000, n_components = 20) plot_dim_red(meth,type="tSNE")
At visualization, we can provide the methrix
object to allow color or shape annotation of groups or samples.
scMethrix::plot_pca(meth, col_anno = "Day", shape_anno = "Replicate")
scMethrix
offers the possibility of region based filtering. With this option, selected regions (e.g. promoters) can be visualized. We will use the AnnotationHub
package to assess basic annotation categories.
if(!requireNamespace("AnnotationHub")) BiocManager::install("AnnotationHub", update = F)
library("AnnotationHub") ah = AnnotationHub() qhs = query(ah, c("RefSeq", "Mus musculus", "mm10")) genes = qhs[[1]] promotors = promoters(genes)
meth.subset <- subset_methrix(meth,regions = promotors) meth <- scMethrix::dim_red_scMethrix(meth.subset,type="PCA",top_var = 10000, n_components = 20) plot_dim_red(meth,type="PCA")
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.