Consider alternative names for package and GitHub repo, e.g., bsbs
, bssim
. Should package name be CamelCase
, lowercase
, etc., e.g.:
MethSim
: Consistent with GenomicTuples
, MethylationTuples
, etc.methsim
: Simpler to type.The steps to partitioning a single methylome are:
GenomicRanges::GRanges
object with mcols
named T
and M
storing the total number of reads and number of methylated reads, respectively. This object must also contain the seqlengths
, which are most easily stored as a complete Seqinfo
object in the seqinfo
slot.MethylSeekR
authors take this as evidence of PMRs and recommend that these by masked.Steps 1-4 are basically the job of MethylSeekR
. Step 5, however, is specific to methsim
.
MethlySeekR
wayJust follow the MethylSeekR
vignette to achieve Steps 1-4. Step 5 is specific to methsim
.
MethylationTuples
wayUse the convenience functions provided by methsim
to work with existing MethylationTuples::MethtPat
objects to achieve Steps 1-4. Step 5 is specific to methsim
.
MethylSeekR
from 1-4.MethPat
objects.MethylationTuples::readMethtuple
->
MethylationTuples::filterOutVariants
-> as(MethPat, "MethylSeekRGR")
or MethylationTuples::readMethtuple
->
as(MethPat, "MethylSeekRGR")
->
lapply(list_of_msrgr, MethylSeekR::removeSNPs)
.MethylSeekR::plotAlphaDistributionOneChr
and
MethylSeekR::segmentPMDs
.MethylSeekR::calculateFDRs
and
MethylSeekR::segmentUMRsLMRs
.methsim::partitionMethylome
TODO: Run a MethylSeekRGR
object through the MethylSeekR
pipeline and get to writing the "Step 5" functionality, i.e., partitionMethylome
.
extractSimulateMethylomeParams()
It would be nice to have a function that creates a SimulateMethylomeParam
object. Input would be two MethPat
objects containing 1-tuples and 2-tuples, respectively. See processOneTuples()
and processTwoTuples()
below:
processOneTuples <- function(dataset, seqlevels, min_cov) { if (missing(seqlevels)) { stop("Must supply 'seqlevels'.") } # Read in data methpat <- readRDS(paste0("../processed_data/", dataset, "/", dataset, "_1_tuples_strand_collapsed.rds")) l_pm <- readRDS(paste0("rds/", dataset, "/PartitionedMethylome/", dataset, "_pm.rds")) # Compute beta-values and annotate by region type beta <- bplapply(names(l_pm), function(sn, methpat, l_pm, min_cov, seqlevels) { pm <- l_pm[[sn]] # Only want data on sample 'sn'. methpat <- methpat[, sn] # Retain only the relevant seqlevels. methpat <- keepSeqlevels(methpat, seqlevels) # Apply MethylationTuples::methLevel() val <- funByPM(FUN = MethylationTuples::methLevel, pm = pm, methpat = methpat, min_cov = min_cov) # Add information not returned by methLevel() val[, sample := sn] setnames(val, c("beta", "type", "sample")) setkeyv(val, c("sample", "type", "beta")) val }, methpat = methpat, l_pm = l_pm, min_cov = min_cov, seqlevels = seqlevels) beta <- rbindlist(beta) # Tabulate frequency of each beta-value by sample and type. beta[, .N, by = list(sample, type, beta)] } processTwoTuples <- function(dataset, seqlevels, min_cov) { if (missing(seqlevels)) { stop("Must supply 'seqlevels'.") } # Read in data methpat <- readRDS(paste0("../processed_data/", dataset, "/", dataset, "_2_tuples_strand_collapsed.rds")) l_pm <- readRDS(paste0("rds/", dataset, "/PartitionedMethylome/", dataset, "_pm.rds")) # Compute beta-values and annotate by region type lor <- bplapply(names(l_pm), function(sn, methpat, l_pm, min_cov, seqlevels, method, offset) { pm <- l_pm[[sn]] # Only want data on sample 'sn'. methpat <- methpat[, sn] # Retain only the relevant seqlevels. methpat <- keepSeqlevels(methpat, seqlevels) # Apply MethylationTuples::cometh() funByPM(MethylationTuples::cometh, pm = pm, methpat = methpat, min_cov = min_cov, method = method, offset = offset) }, methpat = methpat, l_pm = l_pm, min_cov = min_cov, seqlevels = seqlevels, method = "lor", offset = 0.5) lor <- rbindlist(lor) # NOTE: This ignores strand. lor_reduced <- lor[, IPD := pos2 - pos1][ , list(IPD, sample, type, statistic)][ , .N, by = list(sample, IPD, type, statistic)] setorder(lor_reduced, sample, IPD, type, -N) lor_reduced }
Each sample needs an underlying "true" methylome ("truth") from which the reads are simulated.
In order of increasing biological variability in the truth:
Technical replicates correspond to 1. Biological replicates fall somewhere between 2-3. I suspect that 4 doesn't allow for sufficient control over the process to be generally useful.
To add DMRs requires introducing (known) biological differences between experimental conditions. I think this is best done by altering the region-specific parameters of one experimental condition.
These are questions to explore once I am able to simulate a single sample's worth of data (in order of simplicity):
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.