knitr::opts_chunk$set( collapse = TRUE, comment = "#>" )
The simulations here are meant to be portable, toy-examples of eSVD2, with the intent to demonstrate that an installation of eSVD2 was successful.
A successful installation of eSVD2
is required for this. See the last section of this README of all the system/package dependencies used when creating these simulations. Both simulations should complete in less than 2 minutes.
Broadly speaking, eSVD-DE (the name of the method, which is implemented in the eSVD2
package) is a pipeline of 6 different function calls. The procedure starts with primarily 3 inputs,
dat
, either matrix
or dgCMatrix
, with n
rows (for n
cells) and p
columns (for p
genes). We advise making sure each row/column of either matrix has non-zero variance prior to using this pipeline.covariates
, a matrix
with n
rows with the same rownames as dat
where the columns represent the different covariates Notably, this should contain only numerical columns (i.e., all categorical variables should have already been split into numerous indicator variables), and all the columns in covariates
will (strictly speaking) be included in the eSVD matrix factorization model.metadata_individual
, a factor
vector with length n
which denotes which cell originates from which individual.library(eSVD2) library(Seurat)
We create a simple dataset that has no differentially expressed genes via the eSVD2::generate_null()
function.
set.seed(10) res <- eSVD2::generate_null() covariates <- res$covariates metadata_individual <- res$metadata_individual obs_mat <- res$obs_mat
If you have Seurat also installed, you can run the following lines as well,
set.seed(10) meta_data <- data.frame(covariates[, c("CC", "Sex", "Age")]) meta_data$Individual <- metadata_individual seurat_obj <- Seurat::CreateSeuratObject(counts = Matrix::t(obs_mat), meta.data = meta_data) Seurat::VariableFeatures(seurat_obj) <- rownames(seurat_obj) seurat_obj <- Seurat::NormalizeData(seurat_obj) seurat_obj <- Seurat::ScaleData(seurat_obj) seurat_obj <- Seurat::RunPCA(seurat_obj, verbose = F) seurat_obj <- Seurat::RunUMAP(seurat_obj, dims = 1:5) Seurat::DimPlot(seurat_obj, reduction = "umap", group.by = "CC")
The separation between case and control cells is purely determined by the individual covariates of Age and Sex, not the case-control status.
eSVD
, which we will call eSVD_obj
. This is done using the eSVD2::initialize_esvd()
function. All the following operations will use
eSVD_obj
for all calculations through the usage of this method. This is where we pass the inputs dat
, covariates
, and metadata_individual
. Notice that after eSVD2::initialize_esvd()
(and later, after eSVD2::opt_esvd()
) we always call eSVD2::reparameterization_esvd_covariates()
to ensure that the fit remains identifiable.Empirically, we have found it useful to set omitted_variables
as the Log_UMI
and the batch variables. (Here, there are no batch variables.) This means that when eSVD2 reparameterizes the variables, it does not fiddle with the coefficient for Log_UMI
. We found this beneficial since the coefficient for specifically the covariate Log_UMI
bears a special meaning (as it impacts the normalization of the sequencing depth most directly), and we do not wish to "mix" covariates correlated with Log_UMI
too early. (Overall, this is not a strict recommendation, however, we have found this practice to be beneficial in almost all instances we have used eSVD2, so we ourselves have not tried deviating from this convention.)
This typically looks like the following.
set.seed(10) eSVD_obj <- eSVD2::initialize_esvd( dat = obs_mat, covariates = covariates, metadata_individual = metadata_individual, case_control_variable = "CC", bool_intercept = T, k = 5, lambda = 0.1, verbose = 0 ) eSVD_obj <- eSVD2::reparameterization_esvd_covariates( input_obj = eSVD_obj, fit_name = "fit_Init", omitted_variables = "Log_UMI" ) names(eSVD_obj)
eSVD2::opt_esvd()
twice) due to the non-convex nature of the objective function. In the first call of eSVD2::opt_esvd()
, hold all the coefficients to all the covariates aside from the case-control covariate (here, denoted as "CC"
so we remove as much of the case-control covariate effect as possible. Then, in the second call of eSVD2::opt_esvd()
, we allow all the coefficients to be optimized.Note that in the last fit of eSVD2::opt_esvd()
, we do not set any omitted_variables
. This is our typical recommendation -- it is good to allow all coefficients to be reparameterized right before the optimization is all done.
set.seed(10) eSVD_obj <- eSVD2::opt_esvd( input_obj = eSVD_obj, l2pen = 0.1, max_iter = 100, offset_variables = setdiff(colnames(eSVD_obj$covariates), "CC"), tol = 1e-6, verbose = 0, fit_name = "fit_First", fit_previous = "fit_Init" ) eSVD_obj <- eSVD2::reparameterization_esvd_covariates( input_obj = eSVD_obj, fit_name = "fit_First", omitted_variables = "Log_UMI" ) set.seed(10) eSVD_obj <- eSVD2::opt_esvd( input_obj = eSVD_obj, l2pen = 0.1, max_iter = 100, offset_variables = NULL, tol = 1e-6, verbose = 0, fit_name = "fit_Second", fit_previous = "fit_First" ) eSVD_obj <- eSVD2::reparameterization_esvd_covariates( input_obj = eSVD_obj, fit_name = "fit_Second", omitted_variables = NULL ) names(eSVD_obj)
eSVD2::estimate_nuisance()
function.set.seed(10) eSVD_obj <- eSVD2::estimate_nuisance( input_obj = eSVD_obj, bool_covariates_as_library = TRUE, bool_library_includes_interept = TRUE, bool_use_log = FALSE, min_val = 1e-4, verbose = 0 ) names(eSVD_obj)
eSVD2::compute_posterior()
function.set.seed(10) eSVD_obj <- eSVD2::compute_posterior( input_obj = eSVD_obj, bool_adjust_covariates = FALSE, alpha_max = 100, bool_covariates_as_library = TRUE, bool_stabilize_underdispersion = TRUE, library_min = 1, pseudocount = 0 ) names(eSVD_obj)
eSVD2::compute_test_statistic()
function.set.seed(10) eSVD_obj <- eSVD2::compute_test_statistic(input_obj = eSVD_obj, verbose = 0) names(eSVD_obj)
eSVD2::compute_pvalue()
function.set.seed(10) eSVD_obj <- eSVD2::compute_pvalue(input_obj = eSVD_obj) names(eSVD_obj)
To visualize the p-values, we can plot the theoretical quantiles against the observed quantiles. We see that despite the UMAP above showing strong differences between the case and control cells, the p-values are all uniformly distributed. This shows that we do not inflate the Type-1 error.
pvalue_vec <- 10 ^ (-eSVD_obj$pvalue_list$log10pvalue) graphics::plot( x = sort(pvalue_vec), y = seq(0, 1, length.out = length(pvalue_vec)), asp = TRUE, pch = 16, xlab = "Observed quantile", ylab = "Theoretical quantile" ) graphics::lines( x = c(0, 1), y = c(0, 1), col = "red", lty = 2, lwd = 2 )
The following shows the suggested package versions that the developer (GitHub username: linnykos) used when developing the eSVD2 package.
devtools::session_info()
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.