In linnykos/eSVD2: eSVD2 for performing the eSVD-DE for cohort-wide differential expression analysis

knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>"
)

Purpose

The simulations here are meant to be portable, toy-examples of eSVD2, with the intent to demonstrate that an installation of eSVD2 was successful.

A successful installation of eSVD2 is required for this. See the last section of this README of all the system/package dependencies used when creating these simulations. Both simulations should complete in less than 2 minutes.

Overview

Broadly speaking, eSVD-DE (the name of the method, which is implemented in the eSVD2 package) is a pipeline of 6 different function calls. The procedure starts with primarily 3 inputs,

dat, either matrix or dgCMatrix, with n rows (for n cells) and p columns (for p genes). We advise making sure each row/column of either matrix has non-zero variance prior to using this pipeline.
covariates, a matrix with n rows with the same rownames as dat where the columns represent the different covariates Notably, this should contain only numerical columns (i.e., all categorical variables should have already been split into numerous indicator variables), and all the columns in covariates will (strictly speaking) be included in the eSVD matrix factorization model.
metadata_individual, a factor vector with length n which denotes which cell originates from which individual.

library(eSVD2)
library(Seurat)

Simulating data

We create a simple dataset that has no differentially expressed genes via the eSVD2::generate_null() function.

set.seed(10)
res <- eSVD2::generate_null()
covariates <- res$covariates
metadata_individual <- res$metadata_individual
obs_mat <- res$obs_mat

If you have Seurat also installed, you can run the following lines as well,

set.seed(10)
meta_data <- data.frame(covariates[, c("CC", "Sex", "Age")])
meta_data$Individual <- metadata_individual
seurat_obj <- Seurat::CreateSeuratObject(counts = Matrix::t(obs_mat), meta.data = meta_data)
Seurat::VariableFeatures(seurat_obj) <- rownames(seurat_obj)
seurat_obj <- Seurat::NormalizeData(seurat_obj)
seurat_obj <- Seurat::ScaleData(seurat_obj)
seurat_obj <- Seurat::RunPCA(seurat_obj, verbose = F)
seurat_obj <- Seurat::RunUMAP(seurat_obj, dims = 1:5)

Seurat::DimPlot(seurat_obj, reduction = "umap", group.by = "CC")

The separation between case and control cells is purely determined by the individual covariates of Age and Sex, not the case-control status.

Using eSVD-DE

Step 1 (Initializing the eSVD object): First, we initialize the object of class eSVD, which we will call eSVD_obj. This is done using the eSVD2::initialize_esvd() function. All the following operations will use eSVD_obj for all calculations through the usage of this method. This is where we pass the inputs dat, covariates, and metadata_individual. Notice that after eSVD2::initialize_esvd() (and later, after eSVD2::opt_esvd()) we always call eSVD2::reparameterization_esvd_covariates() to ensure that the fit remains identifiable.

Empirically, we have found it useful to set omitted_variables as the Log_UMI and the batch variables. (Here, there are no batch variables.) This means that when eSVD2 reparameterizes the variables, it does not fiddle with the coefficient for Log_UMI. We found this beneficial since the coefficient for specifically the covariate Log_UMI bears a special meaning (as it impacts the normalization of the sequencing depth most directly), and we do not wish to "mix" covariates correlated with Log_UMI too early. (Overall, this is not a strict recommendation, however, we have found this practice to be beneficial in almost all instances we have used eSVD2, so we ourselves have not tried deviating from this convention.)

This typically looks like the following.

set.seed(10)
eSVD_obj <- eSVD2::initialize_esvd(
  dat = obs_mat,
  covariates = covariates,
  metadata_individual = metadata_individual,
  case_control_variable = "CC",
  bool_intercept = T,
  k = 5,
  lambda = 0.1,
  verbose = 0
)
eSVD_obj <- eSVD2::reparameterization_esvd_covariates(
  input_obj = eSVD_obj,
  fit_name = "fit_Init",
  omitted_variables = "Log_UMI"
)

names(eSVD_obj)

Step 2 (Fitting the eSVD object): We now fit the eSVD. Typically, we recommend fitting twice (i.e., calling eSVD2::opt_esvd() twice) due to the non-convex nature of the objective function. In the first call of eSVD2::opt_esvd(), hold all the coefficients to all the covariates aside from the case-control covariate (here, denoted as "CC" so we remove as much of the case-control covariate effect as possible. Then, in the second call of eSVD2::opt_esvd(), we allow all the coefficients to be optimized.

Note that in the last fit of eSVD2::opt_esvd(), we do not set any omitted_variables. This is our typical recommendation -- it is good to allow all coefficients to be reparameterized right before the optimization is all done.

set.seed(10)
eSVD_obj <- eSVD2::opt_esvd(
  input_obj = eSVD_obj,
  l2pen = 0.1,
  max_iter = 100,
  offset_variables = setdiff(colnames(eSVD_obj$covariates), "CC"),
  tol = 1e-6,
  verbose = 0,
  fit_name = "fit_First",
  fit_previous = "fit_Init"
)
eSVD_obj <- eSVD2::reparameterization_esvd_covariates(
  input_obj = eSVD_obj,
  fit_name = "fit_First",
  omitted_variables = "Log_UMI"
)

set.seed(10)
eSVD_obj <- eSVD2::opt_esvd(
  input_obj = eSVD_obj,
  l2pen = 0.1,
  max_iter = 100,
  offset_variables = NULL,
  tol = 1e-6,
  verbose = 0,
  fit_name = "fit_Second",
  fit_previous = "fit_First"
)

eSVD_obj <- eSVD2::reparameterization_esvd_covariates(
  input_obj = eSVD_obj,
  fit_name = "fit_Second",
  omitted_variables = NULL
)

names(eSVD_obj)

Step 3 (Estimating the nuisance (i.e., over-dispersion) parameters): After fitting the eSVD, we need to estimate the nuisance parameters. Since we are modeling the data via a negative binomial, this nuisance parameter is the over-dispersion parameter. This is done via the eSVD2::estimate_nuisance() function.

set.seed(10)
eSVD_obj <- eSVD2::estimate_nuisance(
  input_obj = eSVD_obj,
  bool_covariates_as_library = TRUE,
  bool_library_includes_interept = TRUE,
  bool_use_log = FALSE,
  min_val = 1e-4,
  verbose = 0
)

names(eSVD_obj)

Step 4 (Computing the posterior): With the over-disperison parameter estimated, we are now ready to compute the posterior mean and variance of each cell's gene expression value. This posterior distribution is to balance out the fitted eSVD denoised value against the covariate-adjusted library size. This step ensures that even though we pooled all the genes to denoise the single-cell expression matrix, we do not later inflate the Type-1 error. This is done via the eSVD2::compute_posterior() function.

set.seed(10)
eSVD_obj <- eSVD2::compute_posterior(
  input_obj = eSVD_obj,
  bool_adjust_covariates = FALSE,
  alpha_max = 100,
  bool_covariates_as_library = TRUE,
  bool_stabilize_underdispersion = TRUE,
  library_min = 1,
  pseudocount = 0
)

names(eSVD_obj)

Step 5 (Computing the test statistics): Now we can compute the test statistics using the eSVD2::compute_test_statistic() function.

set.seed(10)
eSVD_obj <- eSVD2::compute_test_statistic(input_obj = eSVD_obj, verbose = 0)

names(eSVD_obj)

Step 6 (Computing the p-values): Lastly, we can compute the p-values via the eSVD2::compute_pvalue() function.

set.seed(10)
eSVD_obj <- eSVD2::compute_pvalue(input_obj = eSVD_obj)

names(eSVD_obj)

To visualize the p-values, we can plot the theoretical quantiles against the observed quantiles. We see that despite the UMAP above showing strong differences between the case and control cells, the p-values are all uniformly distributed. This shows that we do not inflate the Type-1 error.

pvalue_vec <- 10 ^ (-eSVD_obj$pvalue_list$log10pvalue)
graphics::plot(
  x = sort(pvalue_vec),
  y = seq(0, 1, length.out = length(pvalue_vec)),
  asp = TRUE,
  pch = 16,
  xlab = "Observed quantile",
  ylab = "Theoretical quantile"
)
graphics::lines(
  x = c(0, 1),
  y = c(0, 1),
  col = "red",
  lty = 2,
  lwd = 2
)

Setup

The following shows the suggested package versions that the developer (GitHub username: linnykos) used when developing the eSVD2 package.

devtools::session_info()

linnykos/eSVD2 documentation built on July 17, 2024, 12:01 a.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

linnykos/eSVD2
eSVD2 for performing the eSVD-DE for cohort-wide differential expression analysis

In linnykos/eSVD2: eSVD2 for performing the eSVD-DE for cohort-wide differential expression analysis

Purpose

Overview

Simulating data

Using eSVD-DE

Setup

R Package Documentation

Browse R Packages

We want your feedback!

linnykos/eSVD2 eSVD2 for performing the eSVD-DE for cohort-wide differential expression analysis

In linnykos/eSVD2: eSVD2 for performing the eSVD-DE for cohort-wide differential expression analysis

Purpose

Overview

Simulating data

Using eSVD-DE

Setup

R Package Documentation

Browse R Packages

We want your feedback!

linnykos/eSVD2
eSVD2 for performing the eSVD-DE for cohort-wide differential expression analysis