R/generateSyntheticData.R

Defines functions summarizeSyntheticDataSet computeFactorLengths generateLengths getNegativeBinomialDispersion getNegativeBinomialMean get_factor getNegativeBinomialParameters simulateData computeAval computeMval generateSyntheticData

Documented in computeFactorLengths generateLengths generateSyntheticData getNegativeBinomialDispersion getNegativeBinomialMean getNegativeBinomialParameters simulateData summarizeSyntheticDataSet

#' Generate synthetic count data sets
#'
#' Generate synthetic count data sets, following the simulation strategy detailed in Soneson and Delorenzi (2013).
#'
#' In the comparison function, only results obtained for data sets with the same value of the \code{dataset} parameter will be compared. Hence, it is important to give the same value of this parameter e.g. to different replicates generated with the same simulation settings.
#'
#' For more detailed information regarding the different types of outliers, see Soneson and Delorenzi (2013).
#'
#' Mean and dispersion parameters (if \code{relmeans} and/or \code{dispersions} is set to \code{"auto"}) are sampled from values estimated from the data sets by Pickrell et al (2010) and Cheung et al (2010). The data sets were downloaded from the ReCount web page (Frazee et al (2011)) and processed as detailed by Soneson and Delorenzi (2013).
#'
#' To get the actual mean value for the Negative Binomial distribution used for the simulation of counts for a given sample, take the column \code{truemeans.S1} (or \code{truemeans.S2}, if the sample is in condition S2) of the \code{variable.annotations} slot, divide by the sum of the same column and multiply with the base sequencing depth (provided in the \code{info.parameters} list) and the depth factor for the sample (given in the \code{sample.annotations} data frame). Thus, if you have a vector of mean values that you want to provide as the \code{relmeans} argument and make sure to use it 'as-is' in the simulation (for condition S1), make sure to set the \code{seqdepth} argument to the sum of the values in the \code{relmeans} vector, and to set \code{minfact} and \code{maxfact} equal to 1.
#'
#' When the \code{tree} argument is provided (not \code{NULL}),
#' then the "phylogenetic Poisson log-Normal" model is used for the simulations,
#' possibly with varying gene lengths across species
#' (both \code{lengths.relmeans} and \code{lengths.dispersions} must be specified
#' or set to \code{"auto"}.)
#' Phylogenetic simulations use the \code{\link[phylolm]{rTrait}} function
#' from package \code{phylolm}.
#'
#' @param dataset A name or identifier for the data set/simulation settings.
#' @param n.vars The initial number of genes in the simulated data set. Based on the filtering conditions (\code{filter.threshold.total} and \code{filter.threshold.mediancpm}), the number of genes in the final data set may be lower than this number.
#' @param samples.per.cond The number of samples in each of the two conditions.
#' @param n.diffexp The number of genes simulated to be differentially expressed between the two conditions.
#' @param repl.id A replicate ID for the specific simulation instance. Useful for example when generating multiple count matrices with the same simulation settings.
#' @param seqdepth The base sequencing depth (total number of mapped reads). This number is multiplied by a value drawn uniformly between \code{minfact} and \code{maxfact} for each sample to generate data with different actual sequencing depths.
#' @param minfact,maxfact The minimum and maximum for the uniform distribution used to generate factors that are multiplied with \code{seqdepth} to generate individual sequencing depths for the simulated samples.
#' @param relmeans A vector of mean values to use in the simulation of data from the Negative Binomial distribution, or \code{"auto"}. Note that these values may be scaled in order to comply with the given sequencing depth. With the default value (\code{"auto"}), the mean values are sampled from values estimated from the Pickrell and Cheung data sets. If \code{relmeans} is a vector, the provided values will be used as mean values in the simulation for the samples in the first condition. The mean values for the samples in the second condition are generated by combining the \code{relmeans} and \code{effect.size} arguments.
#' @param dispersions A vector or matrix of dispersions to use in the simulation of data from the Negative Binomial distribution, or \code{"auto"}. With the default value (\code{"auto"}), the dispersion values are sampled from values estimated from the Pickrell and Cheung data sets. If both \code{relmeans} and \code{dispersions} are set to \code{"auto"}, the means and dispersion values are sampled in pairs from the values in these data sets. If \code{dispersions} is a single vector, the provided dispersions will be used for simulating data from both conditions. If it is a matrix with two columns, the values in the first column are used for condition 1, and the values in the second column are used for condition 2.
#' @param fraction.upregulated The fraction of the differentially expressed genes that is upregulated in condition 2 compared to condition 1.
#' @param between.group.diffdisp Whether or not the dispersion should be allowed to be different between the conditions. Only applicable if \code{dispersions} is \code{"auto"}.
#' @param filter.threshold.total The filter threshold on the total count for a gene across all samples. All genes for which the total count across all samples is less than the threshold will be filtered out.
#' @param filter.threshold.mediancpm The filter threshold on the median count per million (cpm) for a gene across all samples. All genes for which the median cpm across all samples is less than the threshold will be filtered out.
#' @param fraction.non.overdispersed The fraction of the genes that should be simulated according to a Poisson distribution, without overdispersion. The non-overdispersed genes will be divided proportionally between the upregulated, downregulated and non-differentially expressed genes.
#' @param random.outlier.high.prob The fraction of 'random' outliers with unusually high counts.
#' @param random.outlier.low.prob The fraction of 'random' outliers with unusually low counts.
#' @param single.outlier.high.prob The fraction of 'single' outliers with unusually high counts.
#' @param single.outlier.low.prob The fraction of 'single' outliers with unusually low counts.
#' @param effect.size The strength of the differential expression, i.e., the effect size, between the two conditions. If this is a single number, the effect sizes will be obtained by simulating numbers from an exponential distribution (with rate 1) and adding the results to the \code{effect.size}. For genes that are upregulated in the second condition, the mean in the first condition is multiplied by the effect size. For genes that are downregulated in the second condition, the mean in the first condition is divided by the effect size. It is also possible to provide a vector of effect sizes (one for each gene), which will be used as provided. In this case, the \code{fraction.upregulated} and \code{n.diffexp} arguments will be ignored and the values will be derived from the \code{effect.size} vector.
#' @param output.file If not \code{NULL}, the path to the file where the data object should be saved. The extension should be \code{.rds}, if not it will be changed.
#' @param tree a dated phylogenetic tree of class \code{\link[ape]{phylo}} with `samples.per.cond * 2` species.
#' @param prop.var.tree the proportion of the common variance explained by the tree for each gene. It can be a scalar, in which case the same parameter is used for all genes. Otherwise it needs to be a vector with length \code{n.vars}. Default to 1.
#' @param model.process the process to be used for phylogenetic simulations. One of "BM" or "OU", default to "BM".
#' @param selection.strength if the process is "OU", the selection strength parameter.
#' @param id.condition A named vector, indicating which species is in each condition. Default to first `samples.per.cond` species in condition `1` and others in condition `2`.
#' @param id.species A factor giving the species for each sample. If a tree is used, should be a named vector with names matching the taxa of the tree. Default to \code{rep(1, 2*samples.per.cond)}, i.e. all the samples come from the same species.
#' @param check.id.species Should the species vector be checked against the tree lengths (if provided) ? If TRUE, the function checks that all the samples that share a factor value in \code{id.species} that their distance on the tree is zero, i.e. that they are on the same tip of the tree. Default to TRUE.
#' @param lengths.relmeans An optional vector of mean values to use in the simulation of lengths from the Negative Binomial distribution. Should be of length n.vars. Default to \code{NULL}: the lengths are not taken into account for the simulation. If set to \code{"auto"}, the mean length values are sampled from values estimated from the Stern & Crandall (2018) data set.
#' @param lengths.dispersions An optional vector of dispersions to use in the simulation of data from the Negative Binomial distribution. Should be of length n.vars. Default to \code{NULL}: the lengths are not taken into account for the simulation. If set to \code{"auto"}, the dispersion length values are sampled from values estimated from the Stern & Crandall (2018) data set.
#' @param lengths.phylo If TRUE, the lengths are simulated according to a phylogenetic Poisson Log-Normal model on the tree, with a BM process. If FALSE, they are simulated according to an iid negative binomial distribution. In both cases, \code{lengths.relmeans} and \code{lengths.dispersions} are used. Default to TRUE if a tree is provided.
#'
#' @return A \code{\link{compData}} object. If \code{output.file} is not \code{NULL}, the object is saved in the given \code{output.file} (which should have an \code{.rds} extension).
#' @export
#' @author Charlotte Soneson
#' @examples
#' ## RNA-Seq data
#' mydata.obj <- generateSyntheticData(dataset = "mydata", n.vars = 1000,
#'                                     samples.per.cond = 5, n.diffexp = 100)
#'
#' ## Inter-species RNA-Seq data
#' library(ape)
#' tree <- read.tree(text = "(((A1:0,A2:0,A3:0):1,B1:1):1,((C1:0,C2:0):1.5,(D1:0,D2:0):1.5):0.5);")
#' id.species <- factor(c("A", "A", "A", "B", "C", "C", "D", "D"))
#' names(id.species) <- tree$tip.label
#' mydata.obj <- generateSyntheticData(dataset = "mydata", n.vars = 1000,
#'                                     samples.per.cond = 4, n.diffexp = 100,
#'                                     tree = tree,
#'                                     id.species = id.species,
#'                                     lengths.relmeans = "auto",
#'                                     lengths.dispersions = "auto")
#' @references
#' Soneson C and Delorenzi M (2013): A comparison of methods for differential expression analysis of RNA-seq data. BMC Bioinformatics 14:91
#'
#' Cheung VG, Nayak RR, Wang IX, Elwyn S, Cousins SM, Morley M and Spielman RS (2010): Polymorphic cis- and trans-regulation of human gene expression. PLoS Biology 8(9):e1000480
#'
#' Frazee AC, Langmead B and Leek JT (2011): ReCount: a multi-experiment resource of analysis-ready RNA-seq gene count datasets. BMC Bioinformatics 12:449
#'
#' Pickrell JK, Marioni JC, Pai AA, Degner JF, Engelhardt BE, Nkadori E, Veyrieras JB, Stephens M, Gilad Y and Pritchard JK (2010): Understanding mechanisms underlying human gene expression variation with RNA sequencing. Nature 464, 768-772
#'
#' Robles JA, Qureshi SE, Stephen SJ, Wilson SR, Burden CJ and Taylor JM (2012): Efficient experimental design and analysis strategies for the detection of differential expression using RNA-sequencing. BMC Genomics 13:484
#'
#' Stern DB and Crandall KA (2018): The Evolution of Gene Expression Underlying Vision Loss in Cave Animals. Molecular Biology and Evolution. 35:2005–2014.
#' @import sm
#'
generateSyntheticData <- function(dataset, n.vars, samples.per.cond, n.diffexp, repl.id = 1,
                                  seqdepth = 1e7, minfact = 0.7, maxfact = 1.4,
                                  relmeans = "auto", dispersions = "auto",
                                  fraction.upregulated = 1, between.group.diffdisp = FALSE,
                                  filter.threshold.total = 1, filter.threshold.mediancpm = 0,
                                  fraction.non.overdispersed = 0, random.outlier.high.prob = 0,
                                  random.outlier.low.prob = 0, single.outlier.high.prob = 0,
                                  single.outlier.low.prob = 0, effect.size = 1.5,
                                  output.file = NULL,
                                  tree = NULL, prop.var.tree = 1.0,
                                  model.process = c("BM", "OU"), selection.strength = 0,
                                  id.condition = NULL,
                                  id.species = as.factor(rep(1, 2 * samples.per.cond)),
                                  check.id.species = TRUE,
                                  lengths.relmeans = NULL, lengths.dispersions = NULL, lengths.phylo = TRUE) {

  ## Check output file name
  if (!is.null(output.file)) {
    if (!(substr(output.file, nchar(output.file) - 3, nchar(output.file)) == ".rds")) {
      stop("output.file must be an .rds file.")
    }
  }

  ## Generate a unique ID for the data set
  uID <- paste(sample(c(0:9, letters, LETTERS), 10, replace = TRUE), collapse = "")

  ## Checks for phylogenetic tree
  use_tree <- !is.null(tree) # If tree is specified, use it.
  if (lengths.phylo && !use_tree) {
    lengths.phylo <- FALSE
  }


  ## Check id.species
  if (!is.factor(id.species)) warning("Vector 'id.species' must be a factor. Transforming.")
  id.species <- as.factor(id.species)
  levels(id.species) <- 1:length(levels(id.species))

  if (use_tree) {
    ## Check package
    if (!requireNamespace("ape", quietly = TRUE)) {
      stop("Package 'ape' is needed for phylogenetic simulations.", call. = FALSE)
    }
    ## Check Tree
    if (!inherits(tree, "phylo")) {
      stop("The `tree` must be of class `phylo` from package `ape`.")
    }
    ## Check that the tree has the right number of species
    if (length(tree$tip.label) != samples.per.cond * 2) {
      stop("The tree should have as many species as `samples.per.cond` times two.")
    }
    ## Check that the tree is ultrametric
    if (!ape::is.ultrametric(tree)) {
      stop("The tree should be ultrametric.")
    }

    ## Check Conditions
    if (!is.null(id.condition)) {
      id.condition <- checkParamVector(id.condition, "id.condition", tree)
    } else {
      id.condition <- rep(c(1, 2), each = samples.per.cond)
      names(id.condition) <- tree$tip.label
    }

    ## Check id species
    id.species <- checkSpecies(id.species, "id.species", tree, tol = 1e-10, check.id.species)

    ## Check that all genes are over-dispersed
    if (fraction.non.overdispersed != 0) {
      stop("The Phylogenetic Poisson lognormal distribution is always over-dispersed.")
    }

    ## Check Proportion
    if (!is.vector(prop.var.tree)) {
      stop("`prop.var.tree` should be a vector or a scalar.")
    }
    if (length(prop.var.tree) != n.vars) {
      if (length(prop.var.tree) != 1) stop("`prop.var.tree` should be a vector of length the number of genes, or a scalar (in which case it will be recycled).")
    }
    if (any(prop.var.tree > 1.0 | prop.var.tree < 0.0)) {
      stop("All entries of `prop.var.tree` should be between 0 and 1.")
    }

  } else {
    ## Check id.condition (non phylogenetic)
    if (!is.null(id.condition)) {
      if (length(id.condition) != 2 * samples.per.cond) {
        stop("Vector of conditions `id.condition` should have length `2*samples.per.cond`.")
      }
    } else {
      id.condition <- rep(c(1, 2), each = samples.per.cond)
    }
  }

  ## Checks lengths
  if (is.null(lengths.relmeans) != is.null(lengths.dispersions)) {
    stop("For lengths to be used, both the 'lengths.relmeans' and 'lengths.dispersions' vectors must be provided.")
  }
  use_lengths <- !is.null(lengths.relmeans) # If lengths are specified, use them.
  if (use_lengths) {
    if (is.character(lengths.relmeans) || is.character(lengths.dispersions)) {  # if they are 'auto'
      ### Load mu and phi estimates from real data (Stern dataset)
      length.mu.phi.estimates <- system.file("extdata", "Stern2018.Length.Mu.Phi.Estimates.rds",
                                             package = "compcodeR")
      length.mu.phi.estimates <- readRDS(length.mu.phi.estimates)
      length.mu.estimates <- length.mu.phi.estimates$stern2018.length.mu
      length.phi.estimates <- length.mu.phi.estimates$stern2018.length.phi

      ### Sample a mu and a phi for each gene in condition S1
      to.include <- sample(seq_len(length(length.mu.estimates)), n.vars,
                           replace = ifelse(n.vars > length(length.mu.estimates), TRUE, FALSE))
      if (is.character(lengths.dispersions)) lengths.dispersions <- length.phi.estimates[to.include]
      if (is.character(lengths.relmeans)) lengths.relmeans <- length.mu.estimates[to.include]
    }
    if (length(lengths.relmeans) != n.vars) stop("The length of the 'lengths.relmeans' vector must be the same as the number of simulated genes.")
    if (length(lengths.dispersions) != n.vars) stop("The length of the 'lengths.dispersions' vector must be the same as the number of simulated genes.")
  }

	## Define conditions
  condition <- id.condition
  S1 <- which(condition == 1)
  S2 <- which(condition == 2)

	## Define sets of upregulated, downregulated and non-differentially regulated genes
  if (length(effect.size) == 1) {
    n.upregulated <- floor(fraction.upregulated * n.diffexp)
    if (fraction.upregulated != 0 & n.diffexp != 0) {
      genes.upreg <- seq_len(n.upregulated)
    } else {
      genes.upreg <- NULL
    }
    if (fraction.upregulated != 1 & n.diffexp != 0) {
      genes.downreg <- (n.upregulated + 1):n.diffexp
    } else {
      genes.downreg <- NULL
    }
    genes.nonreg <- setdiff(seq_len(n.vars), union(genes.upreg, genes.downreg))
  } else {
    if (length(effect.size) != n.vars) {
      stop("The length of the effect.size vector must be the same as the number of simulated genes.")
    } else {
      genes.upreg <- which(effect.size > 1)
      genes.downreg <- which(effect.size < 1)
      genes.nonreg <- which(effect.size == 1)
      n.upregulated <- length(genes.upreg)
      n.diffexp <- length(genes.upreg) + length(genes.downreg)
      fraction.upregulated <- n.upregulated/n.diffexp
    }
  }

	### Differentially expressed genes
	differential.expression <- rep(0, n.vars)
	differential.expression[genes.upreg] <- 1
	differential.expression[genes.downreg] <- 1
	upregulation <- rep(0, n.vars)
	upregulation[genes.upreg] <- 1
	downregulation <- rep(0, n.vars)
	downregulation[genes.downreg] <- 1

  if (is.character(relmeans) | is.character(dispersions)) {  # if they are 'auto'
    ### Load mu and phi estimates from real data (Pickrell data set and Cheung data set)
    mu.phi.estimates <- system.file("extdata", "Pickrell.Cheung.Mu.Phi.Estimates.rds",
                                    package = "compcodeR")
    mu.phi.estimates <- readRDS(mu.phi.estimates)
    mu.estimates <- mu.phi.estimates$pickrell.cheung.mu
    phi.estimates <- mu.phi.estimates$pickrell.cheung.phi

    ### Sample a mu and a phi for each gene in condition S1
    to.include <- sample(seq_len(length(mu.estimates)), n.vars,
                         replace = ifelse(n.vars > length(mu.estimates), TRUE, FALSE))
    truedispersions.S1 <- phi.estimates[to.include]
    truemeans.S1 <- mu.estimates[to.include]
  }
  if (!is.character(relmeans)) {
    if (length(relmeans) != n.vars) stop("The length of the relmeans vector must be the same as the number of simulated genes.")
    truemeans.S1 <- c(relmeans)
  }
  if (!is.character(dispersions)) {
    if (nrow(cbind(dispersions)) != n.vars) stop("The number of provided dispersions must be the same as the number of simulated genes.")
    truedispersions.S1 <- cbind(dispersions)[, 1]
    if (ncol(cbind(dispersions)) > 1) {
      truedispersions.S2 <- cbind(dispersions)[, 2]
    } else {
      truedispersions.S2 <- truedispersions.S1
    }
  }

	### Generate sequencing depths (nfacts * Nk)
	nfacts <- stats::runif(2 * samples.per.cond, min = minfact, max = maxfact)
	seq.depths <- nfacts * seqdepth

	### If not all genes are overdispersed, let some of them be Poisson distributed (dispersion = 0)
	overdispersed <- rep(1, n.vars)
	if (fraction.non.overdispersed > 0) {
		overdispersed[genes.upreg[seq_len(round(fraction.non.overdispersed * length(genes.upreg)))]] <- 0
		overdispersed[genes.downreg[seq_len(round(fraction.non.overdispersed * length(genes.downreg)))]] <- 0
		overdispersed[genes.nonreg[seq_len(round(fraction.non.overdispersed * length(genes.nonreg)))]] <- 0
	}

	### Find rates of mapping to each gene in each condition
	prob.S1 <- truemeans.S1
	prob.S2 <- rep(0, length(prob.S1))

  if (length(effect.size) == 1) {
    for (i in seq_len(n.vars)) {
      if (i %in% genes.upreg) {
        prob.S2[i] <- (effect.size + stats::rexp(1, rate = 1)) * prob.S1[i]
      } else {
        if (i %in% genes.downreg) {
          prob.S2[i] <- 1/(effect.size + stats::rexp(1, rate = 1)) * prob.S1[i]
        } else {
          prob.S2[i] <- prob.S1[i]
        }
      }
    }
  } else {
    prob.S2 <- c(effect.size) * prob.S1
  }
	true.log2foldchange <- log2(prob.S2/prob.S1)
	sum.S1 <- sum(prob.S1)
	sum.S2 <- sum(prob.S2)

	### Find new dispersions for condition S2, depending on what prob.S2 is.
  ### From the mu/phi estimates, sample a phi value from
	### the pairs where mu is similar to prob.S2.
  if (is.character(dispersions)) {
    truedispersions.S2 <- truedispersions.S1
    if (between.group.diffdisp == TRUE) {
      for (i in seq_len(length(truedispersions.S2))) {
        sample.base <- phi.estimates[abs(log10(mu.estimates) -
                                           log10(prob.S2[i])) < 0.05]
        if (length(sample.base) < 50) {
          sample.base <-
            phi.estimates[order(abs(log10(mu.estimates) -
                                      log10(prob.S2[i])))][seq_len(500)]
        }
        truedispersions.S2[i] <- sample(sample.base, 1)
      }
    }
  }

	truedispersions.S1 <- truedispersions.S1 * overdispersed
	truedispersions.S2 <- truedispersions.S2 * overdispersed

	## Generate lengths and length factors
	nfact_length.S1 <- nfact_length.S2 <- matrix(1, n.vars, length(S1) + length(S2))
	length_matrix <- matrix(NA, 0, 0)
	if (use_lengths) {
	  if (lengths.phylo) {
	    length_matrix <- generateLengthsPhylo(tree, id.species, lengths.relmeans, lengths.dispersions)
	  } else {
	    length_matrix <- generateLengths(id.species, lengths.relmeans, lengths.dispersions)
	  }
	  nfact_length.S1 <- computeFactorLengths(length_matrix, prob.S1, sum.S1)
	  nfact_length.S2 <- computeFactorLengths(length_matrix, prob.S2, sum.S2)
	}

	if (use_tree) {
	  params_simus <- getNegativeBinomialParameters(n.vars = n.vars,
	                                                S1 = S1, prob.S1 = prob.S1, sum.S1 = sum.S1, truedispersions.S1 = truedispersions.S1, nfact_length.S1 = nfact_length.S1,
	                                                S2 = S2, prob.S2 = prob.S2, sum.S2 = sum.S2, truedispersions.S2 = truedispersions.S2, nfact_length.S2 = nfact_length.S2,
	                                                seq.depths = seq.depths)
	  model.process <- match.arg(model.process)
	  Z <- simulateDataPhylo(params_simus$count_means,
	                         params_simus$count_dispersions,
	                         tree = tree, prop.var.tree = prop.var.tree,
	                         model.process = model.process, selection.strength = selection.strength)
	} else {
	  Z <- simulateData(n.vars = n.vars,
	                    S1 = S1, prob.S1 = prob.S1, sum.S1 = sum.S1, truedispersions.S1 = truedispersions.S1, nfact_length.S1 = nfact_length.S1,
	                    S2 = S2, prob.S2 = prob.S2, sum.S2 = sum.S2, truedispersions.S2 = truedispersions.S2, nfact_length.S2 = nfact_length.S2,
	                    seq.depths = seq.depths,
	                    overdispersed = overdispersed)
	}

	### Add 'random' outliers
	random.outliers <- matrix(0, nrow(Z), ncol(Z))
	random.outliers.factor <- matrix(1, nrow(Z), ncol(Z))
	if (random.outlier.high.prob != 0 | random.outlier.low.prob != 0) {
		for (i in seq_len(nrow(Z))) {
			for (j in seq_len(ncol(Z))) {
        tmp <- stats::runif(1)
				if (tmp < random.outlier.high.prob) {
          random.outliers[i, j] <- 1
          random.outliers.factor[i, j] <- stats::runif(1, min = 5, max = 10)
				} else if (tmp < random.outlier.low.prob + random.outlier.high.prob) {
          random.outliers[i, j] <- (-1)
          random.outliers.factor[i, j] <- 1/stats::runif(1, min = 5, max = 10)
				}
			}
		}
    Z <- round(random.outliers.factor * Z)
	}

	### Add 'single' outliers
	has.single.outlier <- rep(0, n.vars)
	single.outliers <- matrix(0, nrow(Z), ncol(Z))
  single.outliers.factor <- matrix(1, nrow(Z), ncol(Z))
	if (single.outlier.high.prob != 0 | single.outlier.low.prob != 0) {
	  has.single.outlier[genes.upreg[seq_len(floor((single.outlier.high.prob +
	                                                  single.outlier.low.prob) *
	                                                 length(genes.upreg)))]] <- 1
	  has.single.outlier[genes.downreg[seq_len(floor((single.outlier.high.prob +
	                                                    single.outlier.low.prob) *
	                                                   length(genes.downreg)))]] <- 1
	  has.single.outlier[genes.nonreg[seq_len(floor((single.outlier.high.prob +
	                                                   single.outlier.low.prob) *
	                                                  length(genes.nonreg)))]] <- 1

		for (i in seq_len(nrow(Z))) {
			if (has.single.outlier[i] == 1) {
				the.sample <- sample(seq_len(ncol(Z)), 1)
        if (stats::runif(1) < (single.outlier.high.prob/(single.outlier.high.prob +
                                                    single.outlier.low.prob))) {
					single.outliers[i, the.sample] <- 1
          single.outliers.factor[i, the.sample] <- stats::runif(1, min = 5, max = 10)
				} else {
					single.outliers[i, the.sample] <- (-1)
          single.outliers.factor[i, the.sample] <- 1/stats::runif(1, min = 5, max = 10)
				}
			}
		}
    Z <- round(single.outliers.factor * Z)
	}

	### Assign variable names to rows
	rownames(Z) <- seq_len(n.vars)

	### Find number of outliers (random, single, up, down) in each group
	n.random.outliers.up.S1 <- apply(random.outliers[, S1] > 0, 1, sum)
	n.random.outliers.up.S2 <- apply(random.outliers[, S2] > 0, 1, sum)
	n.random.outliers.down.S1 <- apply(random.outliers[, S1] < 0, 1, sum)
	n.random.outliers.down.S2 <- apply(random.outliers[, S2] < 0, 1, sum)
	n.single.outliers.up.S1 <- apply(single.outliers[, S1] > 0, 1, sum)
	n.single.outliers.up.S2 <- apply(single.outliers[, S2] > 0, 1, sum)
	n.single.outliers.down.S1 <- apply(single.outliers[, S1] < 0, 1, sum)
	n.single.outliers.down.S2 <- apply(single.outliers[, S2] < 0, 1, sum)

	### Normalize (TMM) and compute A and M values from the pseudocounts
	nf <- calcNormFactors(Z)
	norm.factors <- nf * colSums(Z)
	common.libsize <- exp(mean(log(colSums(Z))))
	pseudocounts <- sweep(Z + 0.5, 2, norm.factors, '/') * common.libsize
	log2.pseudocounts <- log2(pseudocounts)
	M.value <- apply(log2.pseudocounts[, S2], 1, mean) -
    apply(log2.pseudocounts[, S1], 1, mean)
	A.value <- 0.5*(apply(log2.pseudocounts[, S2], 1, mean) +
	                  apply(log2.pseudocounts[, S1], 1, mean))

	### Normalize using lengths
	if (use_lengths) {
	  ## TPM
	  nf.TPM <- calcNormFactors(Z / length_matrix)
	  norm.factors.TPM <- nf.TPM * colSums(Z / length_matrix)
	  common.libsize.TPM <- exp(mean(log(colSums(Z / length_matrix))))
	  pseudocounts.TPM <- sweep((Z + 0.5) / length_matrix, 2, norm.factors.TPM, '/') * common.libsize.TPM
	  log2.pseudocounts.TPM <- log2(pseudocounts.TPM)
	  M.value.TPM <- apply(log2.pseudocounts.TPM[, S2], 1, mean) - apply(log2.pseudocounts.TPM[, S1], 1, mean)
	  A.value.TPM <- 0.5*(apply(log2.pseudocounts.TPM[, S2], 1, mean) + apply(log2.pseudocounts.TPM[, S1], 1, mean))
	}

	### Create an annotation data frame
	variable.annotations <- data.frame(truedispersions.S1 = truedispersions.S1,
	                                   truedispersions.S2 = truedispersions.S2,
	                                   truemeans.S1 = prob.S1,
	                                   truemeans.S2 = prob.S2,
	                                   n.random.outliers.up.S1 = n.random.outliers.up.S1,
	                                   n.random.outliers.up.S2 = n.random.outliers.up.S2,
	                                   n.random.outliers.down.S1 = n.random.outliers.down.S1,
	                                   n.random.outliers.down.S2 = n.random.outliers.down.S2,
	                                   n.single.outliers.up.S1 = n.single.outliers.up.S1,
	                                   n.single.outliers.up.S2 = n.single.outliers.up.S2,
	                                   n.single.outliers.down.S1 = n.single.outliers.down.S1,
	                                   n.single.outliers.down.S2 = n.single.outliers.down.S2,
	                                   M.value = M.value,
	                                   A.value = A.value,
	                                   truelog2foldchanges = true.log2foldchange,
	                                   upregulation = upregulation,
	                                   downregulation = downregulation,
	                                   differential.expression = differential.expression)
	rownames(variable.annotations) <- rownames(Z)

	### Create a sample annotation data frame
	sample.annotations <- data.frame(condition = condition,
	                                 depth.factor = nfacts)
	if (use_tree) sample.annotations$id.species = id.species

	### Include information about the parameters
	info.parameters <- list('n.diffexp' = n.diffexp,
	                        'fraction.upregulated' = fraction.upregulated,
	                        'between.group.diffdisp' = between.group.diffdisp,
	                        'filter.threshold.total' = filter.threshold.total,
	                        'filter.threshold.mediancpm' = filter.threshold.mediancpm,
	                        'fraction.non.overdispersed' = fraction.non.overdispersed,
	                        'random.outlier.high.prob' = random.outlier.high.prob,
	                        'random.outlier.low.prob' = random.outlier.low.prob,
	                        'single.outlier.high.prob' = single.outlier.high.prob,
	                        'single.outlier.low.prob' = single.outlier.low.prob,
	                        'effect.size' = effect.size,
	                        'samples.per.cond' = samples.per.cond,
	                        'repl.id' = repl.id, 'dataset' = dataset,
	                        'uID' = uID, 'seqdepth' = seqdepth,
	                        'minfact' = minfact, 'maxfact' = maxfact,
	                        'nEff' = nEffNaive(tree, id.condition, model.process, selection.strength),
	                        'nEffRatio' = nEffRatio(tree, id.condition, model.process, selection.strength))
	if (use_tree) {
	  variable.annotations$prop.var.tree <- prop.var.tree
	  sample.annotations$id.condition <-  id.condition
	}
	if (use_lengths) {
	  variable.annotations$lengths.relmeans <- lengths.relmeans
	  variable.annotations$lengths.dispersions <- lengths.dispersions
	  variable.annotations$M.value.TPM <- M.value.TPM
	  variable.annotations$A.value.TPM <- A.value.TPM
	}

	### Filter the data with respect to total count
	s <- apply(Z, 1, sum)
	keep.T <- which(s >= filter.threshold.total)
	Z.T <- Z[keep.T, ]
	if (length(length_matrix) != 0) {
	  length_matrix.T <- length_matrix[keep.T, ]
	} else {
	  length_matrix.T <- length_matrix
	}
	variable.annotations.T <- variable.annotations[keep.T, ]
  filtering <- paste('total count >=', filter.threshold.total)

	### Filter the data with respect to median cpm
	cpm <- sweep(Z.T, 2, apply(Z.T, 2, sum), '/') * 1e6
	m <- apply(cpm, 1, stats::median)
	keep.C <- which(m >= filter.threshold.mediancpm)
	Z.TC <- Z.T[keep.C, ]
	if (length(length_matrix) != 0) {
	  length_matrix.TC <- length_matrix.T[keep.C, ]
	} else {
	  length_matrix.TC <- length_matrix.T
	}
	variable.annotations.TC <- variable.annotations.T[keep.C, ]
  filtering <- paste(filtering, "; ", paste('median cpm >=', filter.threshold.mediancpm))

  ### Generate sample and variable names
  rownames(Z.TC) <- paste("g", 1:nrow(Z.TC), sep = "")
  if (!use_tree) colnames(Z.TC) <- paste("sample", seq_len(ncol(Z.TC)), sep = "")
  rownames(sample.annotations) <- colnames(Z.TC)
  rownames(variable.annotations.TC) <- rownames(Z.TC)
  if (length(length_matrix.TC) != 0) {
    colnames(length_matrix.TC) <- colnames(Z.TC)
    rownames(length_matrix.TC) <- rownames(Z.TC)
  }

  data.object <- compData(count.matrix = Z.TC,
                          variable.annotations = variable.annotations.TC,
                          sample.annotations = sample.annotations,
                          filtering = filtering,
                          info.parameters = info.parameters)

  if (use_tree || use_lengths) {
    data.object <- phyloCompDataFromCompData(data.object,
                                             tree = tree,
                                             length.matrix = length_matrix.TC)
  }

  ## Save results
  if (!is.null(output.file)) {
    saveRDS(data.object, file = output.file)
  }

	return(invisible(data.object))
}

computeMval <- function(count.matrix, conditions) {
  if (length(unique(conditions)) != 2) stop("Must have exactly two groups to calculate M-value")
  nf <- calcNormFactors(count.matrix)
  norm.factors <- nf * colSums(count.matrix)
  common.libsize <- exp(mean(log(colSums(count.matrix))))
  pseudocounts <- sweep(count.matrix + 0.5, 2, norm.factors, '/') * common.libsize
  log2.pseudocounts <- log2(pseudocounts)
  M.value <- apply(log2.pseudocounts[, which(conditions == levels(factor(conditions))[2])],
                   1, mean) -
    apply(log2.pseudocounts[, which(conditions == levels(factor(conditions))[1])],
          1, mean)
  return(M.value)
}

computeAval <- function(count.matrix, conditions) {
  if (length(unique(conditions)) != 2) stop("Must have exactly two groups to calculate A-value")
  nf <- calcNormFactors(count.matrix)
  norm.factors <- nf * colSums(count.matrix)
  common.libsize <- exp(mean(log(colSums(count.matrix))))
  pseudocounts <- sweep(count.matrix + 0.5, 2, norm.factors, '/') * common.libsize
  log2.pseudocounts <- log2(pseudocounts)
  A.value <- 0.5*(apply(log2.pseudocounts[, which(conditions == levels(factor(conditions))[2])],
                        1, mean) +
                    apply(log2.pseudocounts[, which(conditions ==
                                                      levels(factor(conditions))[1])],
                          1, mean))
  return(A.value)
}

#' @title Simulate the Data
#'
#' @description
#' Use the Poisson or Negative Binomial model to simulate the data.
#'
#' @inheritParams generateSyntheticData
#' @param S1 Indices in condition 1.
#' @param prob.S1 Vector of means for condition 1.
#' @param sum.S1 Sum of means for condition 1.
#' @param truedispersions.S1 Vector of dispersions for condition 1.
#' @param nfact_length.S1 Matrix of length factors for condition 1.
#' @param S2 Indices in condition 2.
#' @param prob.S2 Vector of means for condition 2.
#' @param sum.S2 Sum of means for condition 2.
#' @param truedispersions.S2 Vector of dispersions for condition 2.
#' @param nfact_length.S2 Matrix of length factors for condition 2.
#' @param overdispersed Indices that are overdispersed.
#'
#' @return Z a n.var times 2*samples.per.cond matrix with the simulated data.
#'
#' @keywords internal
#'
simulateData <- function(n.vars,
                         S1, prob.S1, sum.S1, truedispersions.S1, nfact_length.S1,
                         S2, prob.S2, sum.S2, truedispersions.S2, nfact_length.S2,
                         seq.depths,
                         overdispersed) {
  ### Initialize data matrix
  Z <- matrix(0, n.vars, length(S1) + length(S2))

  ### Generate data
  for (i in seq_len(nrow(Z))) {
    for (j in seq_len(ncol(Z))) {
      if (overdispersed[i] == 1) {
        Z[i, j] <- rnbinom(n = 1, mu = getNegativeBinomialMean(i, j,
                                                               S1, prob.S1, sum.S1, nfact_length.S1,
                                                               S2, prob.S2, sum.S2, nfact_length.S2,
                                                               seq.depths),
                           size = 1 / getNegativeBinomialDispersion(i, j,
                                                                    S1, truedispersions.S1,
                                                                    S2, truedispersions.S2))
      } else {
        Z[i, j] <- rpois(n = 1, lambda = getNegativeBinomialMean(i, j,
                                                                 S1, prob.S1, sum.S1, nfact_length.S1,
                                                                 S2, prob.S2, sum.S2, nfact_length.S2,
                                                                 seq.depths))
      }
    }
  }

  return(Z)
}

#' @title Get all parameters of the NB at once
#'
#' @description
#' Get both the mean and the dispersions of the NB as matrices for all indices.
#'
#' @inheritParams simulateData
#'
#' @return A list of parameters for each entry of the count matrix:
#' \describe{
#' \item{count_means}{a matrix of mean for each gene and sample.}
#' \item{count_dispersions}{a matrix of dispersions for each gene and sample.}
#' }
#'
#' @keywords internal
#'
getNegativeBinomialParameters <- function(n.vars,
                                          S1, prob.S1, sum.S1, truedispersions.S1, nfact_length.S1,
                                          S2, prob.S2, sum.S2, truedispersions.S2, nfact_length.S2,
                                          seq.depths) {
  ### Initialize
  count_means <- matrix(0, n.vars, length(S1) + length(S2))
  count_dispersions <- matrix(0, n.vars, length(S1) + length(S2))

    ### Generate Negative Binomial Parameters
  for (i in seq_len(n.vars)) {
    for (j in seq_len(ncol(count_means))) {
      count_means[i, j] <- getNegativeBinomialMean(i, j,
                                                   S1, prob.S1, sum.S1, nfact_length.S1,
                                                   S2, prob.S2, sum.S2, nfact_length.S2,
                                                   seq.depths)
      count_dispersions[i, j] <- getNegativeBinomialDispersion(i, j,
                                                               S1, truedispersions.S1,
                                                               S2, truedispersions.S2)
    }
  }
  return(list(count_means = count_means,
              count_dispersions = count_dispersions))
}

get_factor <- function(M, i, j) {
  if (length(M) == 0) return(1.0)
  return(M[i, j])
}

#' @title Get NB mean
#'
#' @description
#' Get the NB mean for one gene in one sample
#'
#' @inheritParams simulateData
#' @param i gene index.
#' @param j sample index.
#'
#' @return The mean for gene i in sample j.
#'
#' @keywords internal
#'
getNegativeBinomialMean <- function(i, j,
                                    S1, prob.S1, sum.S1, nfact_length.S1,
                                    S2, prob.S2, sum.S2, nfact_length.S2,
                                    seq.depths) {
  if (j %in% S1) {
    return(prob.S1[i]/sum.S1 * seq.depths[j] * get_factor(nfact_length.S1, i, j))
  } else {
    return(prob.S2[i]/sum.S2 * seq.depths[j] * get_factor(nfact_length.S2, i, j))
  }
}
#' @title Get NB dispersion
#'
#' @description
#' Get the NB dispersion for one gene in one sample
#'
#' @inheritParams simulateData
#' @param i gene index.
#' @param j sample index.
#'
#' @return The dispersion for gene i in sample j.
#'
#' @keywords internal
#'
getNegativeBinomialDispersion <- function(i, j,
                                          S1, truedispersions.S1,
                                          S2, truedispersions.S2) {
  if (j %in% S1) {
    return(truedispersions.S1[i])
  } else {
    return(truedispersions.S2[i])
  }
}

#' @title Simulate a length matrix
#'
#' @description
#' Simulate a length matrix of size n.vars times n.sample, with the length of
#' each gene in each sample.
#'
#' @param id.species An n.sample vector, indicating the species of each sample.
#' @param lengths.relmeans A vector of mean values to use in the simulation of
#' lengths from the Negative Binomial distribution.
#' @param lengths.dispersions A vector or matrix of dispersions to use in the
#' simulation of data from the Negative Binomial distribution.
#'
#' @return A matrix of lengths, with as many columns as the number of species (length of id.species)
#' and as many rows as the number of parameters in lengths.relmeans.
#'
#' @keywords internal
#'
generateLengths <- function(id.species, lengths.relmeans, lengths.dispersions) {
  length_matrix <- matrix(NA, length(lengths.relmeans), length(id.species))
  for(i in 1:length(lengths.relmeans)){
    sims <- rnbinom(length(unique(id.species)),
                    mu = lengths.relmeans[i],
                    size = 1 / lengths.dispersions[i])
    if (any(sims == 0)) {
      ntry <- 1
      while (any(sims == 0) && ntry <= 100) {
        sims <- rnbinom(length(unique(id.species)),
                        mu = lengths.relmeans[i],
                        size = 1 / lengths.dispersions[i])
        ntry <- ntry + 1
      }
      if (any(sims == 0)) {
        warning(paste0("After 100 tries, could not generate non-zero lengths for gene ", i, ". Replacing zeros with the provided mean."))
        sims[sims == 0] <- lengths.relmeans[i]
      }
    }
    length_matrix[i, ] <- sims[id.species]
  }
  return(length_matrix)
}

#' @title Compute Length Normalization Factors
#'
#' @description
#' Compute the factor to be applied for length normalization.
#' Each column of the matrix (samples) is normalized by the weighted average of
#' the column, with weights corresponding to the true probabilities of each gene.
#'
#' @param length_matrix An n.vars times n.sample matrix of lengths of each gene in each sample.
#' @param prob.S1 Vector of means for condition 1.
#' @param sum.S1 Sum of means for condition 1.
#'
#' @return A matrix of the same size as 'length_matrix', with normalization factors
#' to be applied for each sample and each gene.
#'
#' @keywords internal
#'
computeFactorLengths <- function(length_matrix, prob.S1, sum.S1) {
  return(length_matrix %*% diag(1 / colSums(diag(prob.S1 / sum.S1) %*% length_matrix)))
}

#' Summarize a synthetic data set by some diagnostic plots
#'
#' Summarize a synthetic data set (generated by \code{\link{generateSyntheticData}}) by some diagnostic plots.
#'
#' @param data.set A data set, either a \code{\link{compData}} object or a path to an \code{.rds} file where such an object is stored.
#' @param output.filename The filename of the resulting html report (including the path).
#' @export
#' @author Charlotte Soneson
#' @examples
#' tmpdir <- normalizePath(tempdir(), winslash = "/")
#' mydata.obj <- generateSyntheticData(dataset = "mydata", n.vars = 1000,
#'                                     samples.per.cond = 5, n.diffexp = 100,
#'                                     output.file = file.path(tmpdir, "mydata.rds"))
#' if (interactive()) {
#'   summarizeSyntheticDataSet(data.set = file.path(tmpdir, "mydata.rds"),
#'                             output.filename = file.path(tmpdir, "mydata_check.html"))
#' }
summarizeSyntheticDataSet <- function(data.set, output.filename) {

  ## Check that the output.filename ends with .html
  if (!(substr(output.filename, nchar(output.filename) - 4,
               nchar(output.filename)) == ".html")) {
    stop("output.file must be an .html file.")
  }

  ## Generate an .Rmd file
  ##output.filename <- normalizePath(output.filename)
  Rmd.file <- sub(".html", ".Rmd", output.filename)
  md.file <- sub(".html", ".md", output.filename)
  codefile <- file(Rmd.file, open = 'w')
  output.directory <- dirname(output.filename)

  writeLines("### Summary of synthetic data set", codefile)

  opts_chunk$set(fig.path = file.path(output.directory, "compcodeR_check_figure/"))
  opts_chunk$set(fig.width = 9, fig.height = 7)

  if (is.character(data.set) &&
        substr(data.set, nchar(data.set) - 3, nchar(data.set)) == ".rds") {
    dataset.filename <- data.set
    data.set <- readRDS(data.set)
    writeLines(paste("Data set from file:", dataset.filename), codefile)
  }
  if (is(data.set, "compData")) {
    data.set <- data.set
  } else if (is.list(data.set)) {
    data.set <- convertListTocompData(data.set)
  } else {
    stop("Unknown input type")
  }

  ## Print the data parameters
  writeLines(c("```{r settings, echo = FALSE}",
               "print(noquote(lapply(info.parameters(data.set), function(x) {
if (length(x) > 25) x <- noquote(c(x[seq_len(25)], '...'))
               x})), sep = '\n')",
               "```"), codefile)

  writeLines(c("```{r modifications, echo = FALSE}",
               "  variable.annotations(data.set)$differential.expression = as.factor(variable.annotations(data.set)$differential.expression)",
               "variable.annotations(data.set)$total.nbr.outliers = as.factor(variable.annotations(data.set)$n.random.outliers.up.S1 + variable.annotations(data.set)$n.random.outliers.up.S2 + variable.annotations(data.set)$n.random.outliers.down.S1 + variable.annotations(data.set)$n.random.outliers.down.S2 + variable.annotations(data.set)$n.single.outliers.up.S1 + variable.annotations(data.set)$n.single.outliers.up.S2 + variable.annotations(data.set)$n.single.outliers.down.S1 + variable.annotations(data.set)$n.single.outliers.down.S2)",
               "```"), codefile)

  ## MA plots, colored by various annotations
  writeLines(c("```{r calcma, echo = FALSE, dev = 'png', eval = TRUE, include = TRUE}",
               "if (nrow(variable.annotations(data.set)) == 0) {",
               "variable.annotations(data.set) = data.frame(dummy = rep(NA, nrow(count.matrix(data.set))))}",
               "if (is.null(variable.annotations(data.set)$A.value)) {",
               "variable.annotations(data.set)$A.value = computeAval(count.matrix(data.set), sample.annotations(data.set)$condition)}",
               "if (is.null(variable.annotations(data.set)$M.value)) {",
               "variable.annotations(data.set)$M.value = computeMval(count.matrix(data.set), sample.annotations(data.set)$condition)}",
               "if (is.null(variable.annotations(data.set)$differential.expression)) {",
               "variable.annotations(data.set)$differential.expression = factor(rep('NA', nrow(variable.annotations(data.set))))}",
               "if (is.null(variable.annotations(data.set)$total.nbr.outliers)) {",
               "variable.annotations(data.set)$total.nbr.outliers = factor(rep('NA', nrow(variable.annotations(data.set))))}",
               "```"), codefile)

  ## Colored by true differential expression status
  writeLines("### MA plot, colored by true differential expression status", codefile)
  writeLines(c("```{r maplot-trueDEstatus, echo = FALSE, dev = 'png', eval = TRUE, include = TRUE, message = FALSE, error = TRUE, warning = TRUE}",
               "ggplot(variable.annotations(data.set), aes(x = A.value, y = M.value, color = differential.expression)) + geom_point() + scale_colour_manual(values = c('black', 'red'))",
               "```"), codefile)

  if (length(length.matrix(data.set)) != 0) { # There are lengths
    writeLines("### MA plot, log2 TPM normalized, colored by true differential expression status", codefile)
    writeLines(c("```{r maplot-trueDEstatus-logTPM, echo = FALSE, dev = 'png', eval = TRUE, include = TRUE, message = FALSE, error = TRUE, warning = TRUE}",
                 "ggplot(variable.annotations(data.set), aes(x = A.value.TPM, y = M.value.TPM, color = differential.expression)) + geom_point() + scale_colour_manual(values = c('black', 'red'))",
                 "```"), codefile)
  }

  ## Colored by number of outlier counts
  writeLines("### MA plot, colored by total number of outliers", codefile)
  writeLines(c("```{r maplot-nbroutliers, echo = FALSE, dev = 'png', eval = TRUE, include = TRUE, message = FALSE, error = TRUE, warning = TRUE}",
               "ggplot(variable.annotations(data.set), aes(x = A.value, y = M.value, color = total.nbr.outliers)) + geom_point()",
               "```"), codefile)

  ## Plots of estimated log2-fold change (M-value) vs true log2-fold change, colored by true differential expression status
  writeLines("### True log2-fold change vs estimated log2-fold change (M-value)", codefile)
  writeLines(c("```{r logfoldchanges, echo = FALSE, dev = 'png', eval = TRUE, include = TRUE, message = FALSE, error = TRUE, warning = TRUE}",
               "if (!is.null(variable.annotations(data.set)$truelog2foldchanges)) {",
               "ggplot(variable.annotations(data.set), aes(x = truelog2foldchanges, y = M.value, color = differential.expression)) + geom_point() + scale_colour_manual(values = c('black', 'red'))}",
               "```"), codefile)

  if (length(length.matrix(data.set)) != 0) { # There are lengths
    writeLines("### True log2-fold change vs estimated TPM log2-fold change (M-value)", codefile)
    writeLines(c("```{r logfoldchanges-logTPM, echo = FALSE, dev = 'png', eval = TRUE, include = TRUE, message = FALSE, error = TRUE, warning = TRUE}",
                 "if (!is.null(variable.annotations(data.set)$truelog2foldchanges)) {",
                 "ggplot(variable.annotations(data.set), aes(x = truelog2foldchanges, y = M.value.TPM, color = differential.expression)) + geom_point() + scale_colour_manual(values = c('black', 'red'))}",
                 "```"), codefile)
  }

  ## Phylogenetic heatmap
  if (length(phylo.tree(data.set)) != 0 && info.parameters(data.set)$n.diffexp > 0) { # There is a tree, and there are some DE genes
    if (!requireNamespace("ggtree", quietly = TRUE) || !requireNamespace("tidytree", quietly = TRUE)) {
      warning("Packages 'ggtree' and 'tidytree' are needed for phylogenetic heatmap plot. Skipping.", call. = FALSE)
    } else {
      writeLines("### log2 normalized counts heatmap plotted on the phylogeny", codefile)
      writeLines(c("```{r maplot-phyloHeatmap, echo = FALSE, dev = 'png', eval = TRUE, include = TRUE, message = FALSE, error = TRUE, warning = TRUE}",
                   "library(ggtree)",
                   "library(tidytree)",
                   "",
                   "tree <- phylo.tree(data.set)",
                   "conds <- data.frame(label = rownames(sample.annotations(data.set)), condition = as.factor(sample.annotations(data.set)$id.condition))",
                   "gt <- ggtree(tree)",
                   "gt <- gt %<+% conds + geom_tippoint(aes(color = condition))",
                   "",
                   "Z <- count.matrix(data.set)",
                   "nf <- edgeR::calcNormFactors(Z)",
                   "norm.factors <- nf * colSums(Z)",
                   "common.libsize <- exp(mean(log(colSums(Z))))",
                   "pseudocounts <- sweep(Z + 0.5, 2, norm.factors, '/') * common.libsize",
                   "log2.pseudocounts <- log2(pseudocounts)",
                   "",
                   "Z1 <- log2.pseudocounts[which(variable.annotations(data.set)$differential.expression == 1), ]",
                   "Z2 <- log2.pseudocounts[which(variable.annotations(data.set)$differential.expression == 0)[seq_len(nrow(Z1))], ]",
                   "",
                   "gh <- gheatmap(gt, t(Z1), offset = 0.1, width = 5, colnames = FALSE)",
                   "gh <- gheatmap(gh, t(Z2), offset = 5.2, width = 5, colnames = FALSE)",
                   "gh <- gh + scale_fill_viridis_c(option = 'plasma', direction = -1, name = 'log2 norm count')",
                   "gh",
                   "```"), codefile)
      writeLines(c("",
                   "Tips are colored by the true differential expression status.",
                   "Only a subset of `r nrow(Z1) + nrow(Z2)` genes are represented.",
                   "The first block of `r nrow(Z1)` genes are differentially expressed between condition 1 and 2.",
                   "The second block of `r nrow(Z2)` genes are not differentially expressed."),
                 codefile)
    }
  }
  ## Correlation heatmap
  if (length(phylo.tree(data.set)) != 0) { # There is a tree
    writeLines("### log2 normalized counts samples correlation heatmap", codefile)
    writeLines(c("```{r maplot-corHeatmap, echo = FALSE, dev = 'png', eval = TRUE, include = TRUE, message = FALSE, error = TRUE, warning = TRUE}",
                 "Z <- count.matrix(data.set)",
                 "nf <- edgeR::calcNormFactors(Z)",
                 "norm.factors <- nf * colSums(Z)",
                 "common.libsize <- exp(mean(log(colSums(Z))))",
                 "pseudocounts <- sweep(Z + 0.5, 2, norm.factors, '/') * common.libsize",
                 "log2.pseudocounts <- log2(pseudocounts)",
                 "",
                 "sampleDists <- dist( t( log2.pseudocounts ) )",
                 "sampleDistMatrix <- as.matrix(sampleDists)",
                 "heatmap(sampleDistMatrix, col = hcl.colors(256, palette = 'Plasma'))",
                 "```"), codefile)
    writeLines(c("",
                 "Note: the heatmap tree is the correlation tree (the phylogeny is not taken into account in this plot)."),
               codefile)
  }
  ##
  if (length(length.matrix(data.set)) != 0) { # There are lengths
    writeLines("### Lengths: mean versus variance (log2)", codefile)
    writeLines(c(
      "```{r lengths, echo = FALSE, dev = 'png', eval = TRUE, include = TRUE, message = FALSE, error = TRUE, warning = TRUE}",
      "length.matrix_species <- length.matrix(data.set)[, !duplicated(sample.annotations(data.set)$id.species)]",
      "stats_lengths <- data.frame(mean = c(variable.annotations(data.set)$lengths.relmeans,",
      "                                     rowMeans(length.matrix_species)),",
      "                            var = c(variable.annotations(data.set)$lengths.relmeans + variable.annotations(data.set)$lengths.relmeans^2 * variable.annotations(data.set)$lengths.dispersion,",
      "                                    apply(length.matrix_species, 1, var)),",
      "                            simulated = c(rep('true values', nrow(length.matrix_species)),",
      "                                          rep('simulated values', nrow(length.matrix_species))))",
      "ggplot(stats_lengths, aes(x = log2(mean), y = log2(var), color = simulated)) + geom_point() + scale_color_manual(values = c('black', 'blue')) + geom_abline(slope = 1) + coord_fixed(xlim = c(0, NA), ylim = c(0, NA))",
      "```"
    ), codefile)
  }

  close(codefile)

  out <- knitr::knit(input = Rmd.file,
                     output = md.file,
                     envir =  new.env())
  rmarkdown::pandoc_convert(input = out,
                            output = output.filename)
  # markdown::markdownToHTML(file = out,
  #                          output = output.filename,
  #                          encoding = "UTF-8",
  #                          title = "Synthetic data set summary")

  save.rmdfile <- FALSE  ## set to TRUE to save the .Rmd file (debug)
  ## Remove the .Rmd file
  if (!save.rmdfile) file.remove(Rmd.file)

  ## Remove the .md file
  file.remove(md.file)

}
csoneson/compcodeR documentation built on Dec. 23, 2024, 10:42 a.m.