knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) options(tinytex.verbose = TRUE)
Methrix
provides set of function which allows easy importing of various flavors of bedgraphs generated by methylation callers, and many downstream analysis to be performed on large matrices.
This vignette describes basic usage of the package intended to process several large bedgraph files in R. In addition, a detailed exemplary complete data analysis with steps from reading in to annotation and differential methylation calling can be found in our WGBS best practices workflow
if (!requireNamespace("BiocManager", quietly = TRUE)) install.packages("BiocManager") #Installing stable version from BioConductor BiocManager::install("methrix") #Installing developmental version from GitHub BiocManager::install("CompEpigen/methrix")
NOTE
Installation from BioConductor requires the BioC and R versions to be the newest. This arises from the restrictions imposed by BioConductor community which might cause package incompatibilities with the earlier versions of R (for e.g; R < 4.0). In that case installing from GitHub might be easier since it is much more merciful with regards to versions.
read_bedgraphs
function is a versatile bedgraph reader intended to import bedgraph files generated virtually by any sort of methylation calling program. It requires user to provide indices for chromosome names, start position and other required fields. There are also presets available to import bedgraphs
from most common programs such as Bismark
, MethylDackel
, and MethylcTools
.
#Load library library(methrix)
#Genome of your preference to work with if (!requireNamespace("BiocManager", quietly = TRUE)) install.packages("BiocManager") library(BiocManager) if(!requireNamespace("BSgenome.Hsapiens.UCSC.hg19")) { BiocManager::install("BSgenome.Hsapiens.UCSC.hg19") } library(BSgenome.Hsapiens.UCSC.hg19)
#Example bedgraph files bdg_files <- list.files( path = system.file('extdata', package = 'methrix'), pattern = "*bedGraph\\.gz$", full.names = TRUE ) print(basename(bdg_files)) #Generate some sample annotation table sample_anno <- data.frame( row.names = gsub( pattern = "\\.bedGraph\\.gz$", replacement = "", x = basename(bdg_files) ), Condition = c("cancer", 'cancer', "normal", "normal"), Pair = c("pair1", "pair2", "pair1", "pair2"), stringsAsFactors = FALSE ) print(sample_anno)
We can import bedgraph files with the function read_bedgraphs
which reads in the bedgraphs, adds CpGs missing from the reference set, and creates a methylation/coverage matrices. Once the process is complete - it returns an object of class methrix
which in turn inherits SummarizedExperiment class. methrix
object contains 'methylation' and 'coverage' matrices (either in-memory or as on-disk HDF5 arrays) along with pheno-data and other basic info. This object can be passed to all downstream functions for various analysis.
#First extract genome wide CpGs from the desired reference genome hg19_cpgs <- suppressWarnings(methrix::extract_CPGs(ref_genome = "BSgenome.Hsapiens.UCSC.hg19"))
#Read the files meth <- methrix::read_bedgraphs( files = bdg_files, ref_cpgs = hg19_cpgs, chr_idx = 1, start_idx = 2, M_idx = 3, U_idx = 4, stranded = FALSE, zero_based = FALSE, collapse_strands = FALSE, coldata = sample_anno )
Note: Use the argument pipeline
if your bedgraphs are generated with "Bismark", "MethylDeckal", or "MethylcTools". This will automatically figure out the file formats for you, and you dont have to use the arguments chr_idx
start_idx
and so..
#Typing meth shows basic summary.
meth
Get basic summary statistics of the methrix
object with methrix_report
function which produces an interactive html report
methrix::methrix_report(meth = meth, output_dir = tempdir())
Click here for an example report.
Usual task in analysis involves removing uncovered CpGs. i.e, those loci which are not covered across all sample (in other words covered only in subset of samples resulting NA
for rest of the samples ).
meth = methrix::remove_uncovered(m = meth)
meth
One can also remove CpG sites overlaping with common SNPs based on minor allele frequencies.
if(!require(MafDb.1Kgenomes.phase3.hs37d5)) { BiocManager::install("MafDb.1Kgenomes.phase3.hs37d5")} if(!require(GenomicScores)) { BiocManager::install("GenomicScores")}
library(MafDb.1Kgenomes.phase3.hs37d5) library(GenomicScores) meth_snps_filtered <- methrix::remove_snps(m = meth)
#Example data bundled, same as the previously generated meth data("methrix_data") #Coverage matrix coverage_mat <- methrix::get_matrix(m = methrix_data, type = "C") head(coverage_mat)
#Methylation matrix meth_mat <- methrix::get_matrix(m = methrix_data, type = "M") head(meth_mat)
#If you prefer you can attach loci info to the matrix and output in GRanges format meth_mat_with_loci <- methrix::get_matrix(m = methrix_data, type = "M", add_loci = TRUE, in_granges = TRUE) meth_mat_with_loci
Furthermore if you prefer you can filter sites based on coverage conditions.
#e.g; Retain all loci which are covered at-least in two sample by 3 or more reads methrix::coverage_filter(m = methrix_data, cov_thr = 3, min_samples = 2)
Subset operations in methrix
make use of data.table
s fast binary search which is several orders faster than bsseq
or other similar packages.
#Retain sites only from chromosme chr21 methrix::subset_methrix(m = methrix_data, contigs = "chr21")
Regions can be data.table or GRanges format.
#e.g; Retain sites only in TP53 loci target_loci <- GenomicRanges::GRanges("chr21:27867971-27868103") print(target_loci) methrix::subset_methrix(m = methrix_data, regions = target_loci)
methrix::subset_methrix(m = methrix_data, samples = "C1") #Or you could use [] operator to subset by index methrix_data[,1]
meth_stats <- get_stats(m = methrix_data) print(meth_stats)
#Draw mean coverage per sample plot_stats(plot_dat = meth_stats, what = "C", stat = "mean") #Draw mean methylation per sample plot_stats(plot_dat = meth_stats, what = "M", stat = "mean")
mpca <- methrix_pca(m = methrix_data, do_plot = FALSE) #Plot PCA results plot_pca(pca_res = mpca, show_labels = TRUE) #Color code by an annotation plot_pca(pca_res = mpca, m = methrix_data, col_anno = "Condition")
#Violin plots methrix::plot_violin(m = methrix_data)
methrix::plot_coverage(m = methrix_data, type = "dens")
If you prefer to work with bsseq object, you can generate bsseq
object from methrix with the methrix2bsseq
.
if(!require(bsseq)) { BiocManager::install("bsseq")}
library(bsseq) bs_seq <- methrix::methrix2bsseq(m = methrix_data) bs_seq
sessionInfo()
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.