simMAE: Simulate Mean Absolute Error (MAE) in estimating isoform...
In casper: Characterization of Alternative Splicing based on Paired-End Reads

Description Usage Arguments Details Value References See Also Examples

Simulate several future RNA-seq data under various experimental settings (sequencing depth, read length, insert sizes), estimate isoform expression and assess the MAE incurred in the estimation process. The function is a wrapper combining functions simReads and calcExp.

simMAE(nsim, islandid, nreads, readLength, fragLength, burnin=1000, pc, distr, readLength.pilot=readLength, eset.pilot, usePilot=FALSE, retTxsError=FALSE, genomeDB, mc.cores=1, mc.cores.int=1, verbose=FALSE, writeBam=FALSE, bamFile=NULL)

`nsim`	Number of RNA-seq datasets to generate (often as little as `nsim=10` suffice)
`islandid`	When specified this argument indicates to run the simulations only for gene islands with identifiers in `islandid`. When not specified genome-wide simulations are performed.
`nreads`	Vector indicating the target number of read pairs for each experimental setting. The actual number of reads differs from `nreads` to account for non-mappability and random read yield (see details)
`readLength`	Vector indicating the read length in each experimental setting
`fragLength`	Vector indicating the mean insert size in each experimental setting
`burnin`	Number of MCMC burn-in samples (passed on to `calcExp`)
`pc`	Observed path counts in pilot data. When not specified, these are simulated from `eset.pilot`
`distr`	Estimated read start and insert size distributions in pilot data
`readLength.pilot`	Read length in pilot data
`eset.pilot`	ExpressionSet with pilot data expression in log2-RPKM, used to simulate `pc` when not specified by the user. See details
`usePilot`	By default `casper` assumes that the pilot data is from a related experiment rather than the current tissue of interest (`usePilot=FALSE`). Hence, the pilot data is used to simulate new RNA-seq data but not to estimate its expression. However, in some cases we may be interested in re-sequencing the pilot sample at deeper length, in which case one would want to combine the pilot data with the new data to obtain more precise estimates. This can be achieved by setting `usePilot=TRUE`
`retTxsError`	If `retTxsError=TRUE`, `simMAE` returns posterior expected MAE for each individual isoform. This option is not available when `eset.pilot` is specified instead of `pc`. Else the output is a `data.frame` with overall MAE across all isoforms
`genomeDB`	`annotatedGenome` object, as returned by `procGenome`
`mc.cores`	Number of cores to use in the expression estimation step, passed on to `calcExp`
`mc.cores.int`	Number of cores to simulate RNA-seq datasets in parallel
`verbose`	Set `verbose=TRUE` to print progress information
`writeBam`	Set to `TRUE` to write simulated reads to a .bam file
`bamFile`	Name of the .bam file

simMAE simulates nsim datasets under each experimental setting defined by nreads, readLength, fragLength. For each dataset the following steps are performed:

1. The number of reads is nreads * readYield * pmapped, where readYield= runif(1,0.8,1.2) accounts for deviations in read yield and pmapped= runif(1,0.6,0.9)*pmappable is the proportion of mapped reads (60%-90% of the mappable reads according to the piecewise-linear power law of Li et al (2014))

2. True expression levels pi are generated from their posterior distribution given the pilot data.

3. Conditional on pi, RNA-seq data are generated and expression estimates pihat are obtained using calcExp

4. The mean absolute estimation error sum(abs(pihat-pi)) across all isoforms is computed

Ideally simMAE should use pilot data from a relevant related experiment to simulate what future data may look like for the current experiment of interest. The recommened way to do this is to download a .bam file from such a related experiment and processing it in casper with function wrapKnown, as then both gene and isoform expression can be estimated accurately. The object output by wrapKnown is a list with elements named 'pc', 'distr' which can be given as input to simMAE.

As an alternative to specifying pc, simMAE allows setting eset.pilot as pilot data. Gene and isoform expression are then simulated as follows:

1. The number of reads per gene is generated from a Multinomial distribution with success probabilities proportional to 2^exprs{eset.pilot}.

2. Relative isoform expression within each gene are generated from a symmetric Dirichlet distribution with parameter 1/Ig, where Ig is the number of isoforms in gene g.

We emphasize that relative isoform expressions are not trained from the pilot data, and that while the distribution of gene expression levels resembles that in eset.pilot, no attempt is made to match gene identifiers and hence the results for individual genes should not be trusted (hence this option is only available when retTxsError==FALSE.

If retTxsError==TRUE, simMAE returns posterior expected MAE for each individual isoform. Else the output is a data.frame with overall MAE across all isoforms

Stephan-Otto Attolini C., Pena V., Rossell D. Bayesian designs for personalized alternative splicing RNA-seq studies (2014)

Li, W. and Freudenberg, J. and Miramontes, P. Diminishing return for increased Mappability with longer sequencing reads: implications of the k-mer distributions in the human genome. BMC Bioinformatics, 15, 2 (2014)

wrapKnown,simReads,calcExp