The procedure we use for simulating signals of polymerase density uses composites of empirical signal from exemplar regions. To do this, first we must curate a set of genes where virtually all polymerase density in the region can be confidently attributed to a single annotation. Included with this package are a curated set of archetypes that can be loaded as follows.
library(nascentRNASim) ta <- load_archetype_set(id = "celastrol_dukler2017")
Archetypes can be viewed as follows:
view_archetype(ta, 1)
Because we simulate empirically from known profiles, the length of transcripts we can simulate from are limited to the lengths we already have profiles for. This means that we must simulate transcript annotations based on our archetype set so that the lengths of the transcripts match. However, we do simulate such that we match the properties of an existing template as closely as possible. In order to do this we require a template which can be any existing GRanges object. In this case we use a pre-cached txdb of hg38 and sample a thousand loci for simulation.
cache_set_dir() # Only download a txdb if you need a list of transcripts # download_txdb("hsapiens", ensembl_version = "98") txdb <- suppressMessages(load_txdb("homo_sapiens-grch37.p13-ensembl_genes_99.txdb")) template <- transcripts(txdb) sim_anno <- simulate_annotations(ta = ta, n = 100, template = template, seed = 5, genome = "hg19")
Then we simulate abundances for the annotations. In order to do that you must specify a
function that generates a vector of random numbers greater than or equal to zero. It also
must have as an argument n which specifies the number of observations to draw (A number
of the random number generators from the stats package fits all these requirements; eg.
rlnorm, rpois, rgamma, etc.). Additionally, you may design your own custom function to
pass here as long as it meets these criteria. Additional arguments to the sampling_dist
function can be passed with their keywords. There is also a zero_point_mass
argument
that sets the fraction of genes with zero abundance.
annotations <- simulate_abundances(annotations = sim_anno, sampling_dist = rgtex)
Then we resample from the archetypes to simulate 5' read density
sim_dat <- simulate_data(ta, annotations = annotations, show_progress = T)
Once we've simulated data, the annotations and data it can be exported to a directory along with the archetype object used to generate it.
tmp_dir <- tempdir() export_simulation(annotations = annotations, data = sim_dat, ta = ta, directory = tmp_dir, simulation_id = "test")
No automated way to do this currently. Generally we recommend filtering a list of
transcripts by a crude metric fo per transcript abundance to get only high abundance
candidates, then manually curating those that would be good archetypes. Here are the
rough criteria we used in curating the celastrol_dukler2017
archetypes:
Once a set of archetypes has been curated you can store a reduced lightweight bigwig that contains only the archetype regions plus some flanking signal. This makes it easier to package up for other people to use.
# store_reduced_archetype_bw(bed_path, bigwig_plus, bigwig_minus, # out_dir = "~/Downloads", flank_length = 2e4)
transcript_archetypes
objectid_path <- system.file("extdata", "archetype_sets", "celastrol_dukler2017", package = "nascentRNASim") # Load archetype annotations bed_path <- file.path(id_path, paste0(id, "_archetypes.bed")) bed <- rtracklayer::import.bed(bed_path) # Load bigwigs bwp_path <- file.path(id_path, paste0(id, "_plus.bw")) bwm_path <- file.path(id_path, paste0(id, "_minus.bw")) # Create transcript_archetype object, see documentation for details ta <- transcript_archetypes(transcripts = bed, bigwig_plus = bwp_path, bigwig_minus = bwm_path, flank = 6e3, abundance_filter = 0.05, mask = c(1e3, 1e3))
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.