In davetgerrard/GenomicLayers: Simple, sequence-based simulation of epi-genomes.

This vignette aims to model a series of epigenetic changes starting from seeding events by a transcription factor (RAP1) leading to the recruitment of a chromatin binding protein complex (Sir), which then spreads itself along the chromosome. The model will use the sequence of all nuclear chromosomes (~ 12 million bps) of Saccharomyces cerevisae (Brewer's yeast).

We will build two versions of the model differing in the sequence motif that RAP1 binds to DNA. These models are based on the following literature: [@lustig_tethered_1996] [@wahlin_saccharomyces_2000] [@rusche_establishment_2003]. More is known about this system and the models presented here could be extended to try to include greater complexity.

Specify Binding factors for RAP1 and Sir3

Here we are going to create two versions of the DNA-binding protein, RAP1 to be used in two slightly different models. Both will bind the DNA sequence GGTGT (or ACACC) but the second version (here 'bf.RAP1.4') will require two adjacent motifs separated by up to 3 bp. The variable length of the gap (0-3 bp) is achieved using a regular expression notation, which is accepted by some types of Biostrings searches. To use these in GenomicLayers, we need to use the createBindingFactor.regexp function to create the binding factor.

Additionally, we are going to create here another binding factor to represent Sir3, a key part of the Sir complex that is recruited by RAP1 after it has bound to DNA. We are ignoring the rest of the Sir complex here.

library(GenomicLayers)   # this is the first bit of code so we need to load the package 

# RAP1  
bf.RAP1.3 <- createBindingFactor.DNA_consensus("bf.RAP1.3", patternString="GGTGT", 
                                               mod.layers = "Sir3p_potential", mod.marks=1, stateWidth=5)

bf.RAP1.4 <- createBindingFactor.DNA_regexp("bf.RAP1.4", forRegExp="GGTGT(.{0,3})GGTGT",
                                            revRegExp="ACACC(.{0,3})ACACC", patternLength = 10,
                                            mod.layers = "Sir3p_potential", mod.marks=1, stateWidth=5, )

# Sir3p - binds to seeding sites 
bf.Sir3p_S <- createBindingFactor.layer_region("bf.Sir3p_S", type="layer_region", 
                                               patternLength = 5, patternString = "N",
                                               profile.layers = c("Sir3p_potential"),
                                               stateWidth=5, profile.marks = c(1), 
                                               mod.layers = "bound.Sir3p", mod.marks = 1)

Create a spreader binding factor

We also need something to spread Sir3p along the chromosome. We will create a binding factor that makes changes either up- or downstream of its own binding site. The offset.method parameter of the createBindingFactor functions will accept a function as an argument. We can use this to have the binding factor draw a variable offset value from a distribution. Before creating such a binding factor, we need to create the function to be passed to offset.method.

upDownFuncRnorm <- function(n, offset.mean, offset.sd)  {
  y <- round(rnorm(n, mean=offset.mean, sd=offset.sd))
  z <- sample(c(1, -1), length(y), replace=T)  # random vector of 1,-1  to negate half the values

  return(round(y*z))
}
# below is a simpler version that just returns randomly signed version of a fixed offset. This is not used in this vignette.
upDownFunc<- function(x)  {
  return(sample(c(x, -x), 1))
}

Now we have the functions to aid in spreading, we can create the binding factor that will make use of it. It will recognise marks on layer bound.Sir3p and add further marks to the same layer either upstream or downstream of any hits. The distance is derived from a normal distribution with mean 147 and standard deviation of 30. Values are equally likely to positive or negative.

bf.Sir3p.spread <-  createBindingFactor.layer_region("bf.Sir3p.spread", type="layer_region", 
                                                     patternLength = 5, 
                                                     profile.layers = c("bound.Sir3p"),
                                                     stateWidth=5, profile.marks = c(1), 
                                                     mod.layers = "bound.Sir3p", mod.marks = 1, offset = 100,
                                                     offset.method=upDownFuncRnorm, 
                                                     offset.params=list(offset.mean=147, offset.sd=30))

Create LayerSet object

Now that all the bindingFactors are specified, they require a target genome. This is handled as part of a layerSet object, which stores both the genome and a number of layers that track epigenetic states on the genome during the simulation.

Firstly, we load the genome and subset it to the set of chromosomes we want to use. This could be any subset of chromosomes, or even just one. Here we will use all nuclear chromosomes as we are interested in a protein that binds at telomeres and interacts with chromatin.

library(BSgenome.Scerevisiae.UCSC.sacCer3) 

genome <- BSgenome.Scerevisiae.UCSC.sacCer3

sequences_to_keep <- setdiff(seqnames(genome), "chrM")  

genomeNuc <- keepBSgenomeSequences(genome, sequences_to_keep)
genomeNuc    # this should now still be a useable BSgenome object but with no mitochondrial chromosome.

Once we have specified the genome subset, we need to decide names for layers we want to use. As we have already named the layers when we made the binding factors ("Sir3p_potential", "bound.Sir3p"), we should use the same names here.

Create the layerSet object to include the genome and all the layers we may need

Layers needed:

"Sir3p_potential": RAP1 binding/recruitment - can act as recruiter for other repressive modifiers.
"bound.Sir3p" : represents bound Sir complex. Can be recruited to "Sir3p_potential" or can be spread along the chromosome.

Now set up a LayerSet on the genome. This is a list where the first element LAYER.0 is the genome and GRanges list elements store position information for each layer. Two additional layers "H3K4me","H4K16ac" are specified here but not used in these simulations - they will be ignored as no binding factors have been specified to match or modify them.

layerGenomeNuc <- createLayerSet.BSgenome(genome=genomeNuc, 
                      layer.names=c("Sir3p_potential",  "bound.Sir3p", 
                        "H3K4me","H4K16ac"),  n.layers=4)

MODEL 1

The model is the combination of the layerSet (including the genome), all the binding factors, and a vector of abundance values which determine the number of sites across the genome that each bindingFactor can bind (and modify) during each iteration.

Combine all bfs into a list to pass to runLayerBinding. Order DOES matter. Then specify a vector of abundances.

bfSet.1 <- list(bf.RAP1.3=bf.RAP1.3, 
                bf.Sir3p_S=bf.Sir3p_S,
                bf.Sir3p.spread=bf.Sir3p.spread                

)   # binding factors must be in a list and named. Easiest to use each BFs name.

# specify abundances of bfs.
abs.1 <- structure(c(50,20,1000), names=c(names(bfSet.1)))   *  40  # multiply by 40 to scale up from chrIII to whole nuclear genome.
abs.1

Before running the simulation, it can be helpful to pre-calculate the binding sites for those factors that match to DNA. These caches can be re-used across simulations and are also a good way to check the binding factors are finding DNA binding sites as expected. Alternatively, you can skip this step and runLayerBinding will create these hit caches during the first iteration.

# take a copy of the subGenome object 
layerGenomeNuc.1 <- layerGenomeNuc
# cache the sequence specific parts of profiles
layerGenomeNuc.1 <- generateHitsCache(layerList=layerGenomeNuc.1, factorSet=bfSet.1[c('bf.RAP1.3' )], cache.layers="LAYER.0")

Finally we are ready to run the simulation. Here we run only for 10 iterations but we recommend running for 100 or more iterations and storing the output every 10 iterations. The image below shows the build up of marks on the two layers very close to one telomere of chrIII. The simulation is running across the whole genome so it is possible to look at any chromosomal region.

par(mar=c(5,9,5,2), mfrow=c(5,2))

for(i in 1:10) {
  print(i)
  layerGenomeNuc.1 <- runLayerBinding.BSgenome(layerList=layerGenomeNuc.1, 
                                               factorSet=bfSet.1, 
                                               bf.abundances = abs.1, 
                                               verbose=FALSE, collect.stats=TRUE)   # set verbose = TRUE to see more information about the matching.

  # the code below is commented out but shows how we can store the full model state every 'saveFreq' iterations.
  #if((i %% saveFreq) == 0) {   # export current model state.
  #    tempFileOut <- paste0(outFile, ".iter.", i, ".RData")
  #   print(paste("Saving to ", tempFileOut))
  #  save(layerGenomeNuc.1, file=tempFileOut)  #  use this to save model state periodically.

  #}

# here using an internal function to make a simple plot of some results near a telomere
    # because the layers here have long names.
  GenomicLayers:::plotLayers(layerSet = layerGenomeNuc.1, chrom="chrIII", xlim=c(1, 5000), layerNames=c('Sir3p_potential','bound.Sir3p'))

}

Check the output from this model

By default, runLayerBinding also generates a useful table of coverage after each iteration. This can be useful to plot trends as the simulation proceeds.

layerGenomeNuc.1$history

MODEL 2

Model 2 is exactly as model 1 except that RAP1 is substituted for a version that requires two tandem motifs gapped by 0-3 bps. We need to re-define the list of bindingFactors, their abundances, and reset the simulation to starting conditions.

bfSet.2 <- list(bf.RAP1.4=bf.RAP1.4, 
                bf.Sir3p_S=bf.Sir3p_S,
                bf.Sir3p.spread=bf.Sir3p.spread                

)   # binding factors must be in a list and named. Easiest to use each BFs name.

# specify abundances of bfs.
abs.2 <- structure(c(50,20,1000), names=c(names(bfSet.2)))   *  40  # multiply by 40 to scale up from chrIII to whole nuclear genome.

Then build the cache again. Here we can re-use the unused layerGenomeNuc object we made earlier.

# take a copy of the subGenome object 
layerGenomeNuc.2 <- layerGenomeNuc
# cache the sequence specific parts of profiles
layerGenomeNuc.2 <- generateHitsCache(layerList=layerGenomeNuc.2, factorSet=bfSet.2[c('bf.RAP1.4' )], cache.layers="LAYER.0")

Now we can run model 2.

par(mar=c(5,9,5,2), mfrow=c(5,2))

for(i in 1:10) {
  print(i)
  layerGenomeNuc.2 <- runLayerBinding.BSgenome(layerList=layerGenomeNuc.2, 
                                               factorSet=bfSet.2, 
                                               bf.abundances = abs.2, 
                                               verbose=FALSE, collect.stats=TRUE) 

  # here using an internal function to make a simple plot of some results near a telomere
  # because the layers here have long names.
  GenomicLayers:::plotLayers(layerSet = layerGenomeNuc.2, chrom="chrIII", xlim=c(1, 5000), layerNames=c('Sir3p_potential','bound.Sir3p'))


}

Hopefully you can see that for the chrIII telomere, there are fewer initial seeding sights and spreading is more important. You could try using the plotLayers function to look at the situation in other genomic regions.

Running many versions of the same simulation

GenomicLayers is even more useful if the simulation is run many times and then results are combined across runs. This can be used to generate scores for every bp of the genomes - i.e. in how many simulations did a particular nucleotide gain a particular mark. In practice, we use a high performance compute cluster to run many simulations quickly. We then use a package called plotGardener to show the coverage plots after many sets of results have been combined.

To repeat a simulation many times, it is useful to put it into a script and feed in parameters such as the number of iterations and the frequency with which to save the results. For this, I recommend the getopt package. See below for how we use this. To make a general script of running one of the above models, use the getopt library

# Load the getopt library
library(getopt)

# Define command-line options
spec = matrix(c(
  'verbose', 'v', 2, "integer",
  'help'   , 'h', 0, "logical",
  'out.file' , 'o', 2, "character", 
  'iterations', 'i', 2, "integer",
  'saveFreq',  's', 2, "integer"
), byrow=TRUE, ncol=4)

# Parse command-line options
opt <- getopt(spec)

# Set default values for options if they are not provided
if ( is.null(opt$verbose)) { opt$verbose <- FALSE }

# Output file
outFile <- opt$out.file
if (is.null(opt$out.file)) { outFile <- "outputfile" }

iters <- opt$iterations
if (is.null(opt$iterations)) { iters <- 100 }

saveFreq <- opt$saveFreq
if (is.null(opt$saveFreq)) { saveFreq <- iters }

# Print the values for verification
print(paste("Out file stem:", opt$out.file))