In linnykos/eSVD2: eSVD2 for performing the eSVD-DE for cohort-wide differential expression analysis

knitr::opts_chunk$set(echo = TRUE)

For this vignette, we will be using a dataset that can be downloaded from https://www.dropbox.com/scl/fi/csyz62obxx8xj5gl5sgp4/asd_seurat.RData?rlkey=d7dlhqjesunal52u8z2iomjsg&dl=0. This file is called asd_seurat.RData, and contains the following variables:

asd: The Seurat object we will be analyzing, containing 13,302 cells and 6,599 genes (all of which are deemed highly variable)
date_of_run: The date when asd was constructed
session_info: The session info when asd was constructed

Notably, this is not the version of the dataset used in the eSVD-DE paper. Instead, the intent of this vignette to show you how to perform an eSVD-DE analysis. That is, while the processed dataset in the paper is different from this vignette's processed dataset, the procedure we showcase is here is intended as the typical usage of our method.

If you wish to see how this data was preprocessed starting from publicly available data (or you cannot download the .RData file), please see https://linnykos.github.io/eSVD2/articles/asd-preprocess.html.

Relevant citation:

Velmeshev, D., Schirmer, L., Jung, D., Haeussler, M., Perez, Y., Mayer, S., ... & Kriegstein, A. R. (2019). Single-cell genomics identifies cell type–specific molecular changes in autism. Science, 364(6441), 685-689.

library(Seurat)
library(SeuratObject)
library(EnhancedVolcano)

set.seed(10)
date_of_run <- Sys.time()
session_info <- devtools::session_info()

Loading the data

We first load the data. We can use the table() function to roughly see how many cells each donor contributes and how many donors were classified to be ASD or Control.

load("asd_seurat.RData")
asd
table(asd$individual, asd$diagnosis)

An object of class Seurat 
6599 features across 13302 samples within 1 assay 
Active assay: RNA (6599 features, 6599 variable features)
 2 layers present: counts, data

            ASD Control
indiv_1823    0      72
indiv_4341    0     733
indiv_4849  428       0
indiv_4899  349       0
indiv_5144  202       0
indiv_5163    0     145
indiv_5242    0      81
indiv_5278  944       0
indiv_5294  449       0
indiv_5387    0    1129
indiv_5391    0     237
indiv_5403   69       0
indiv_5408    0     162
indiv_5419  327       0
indiv_5531  631       0
indiv_5538    0     391
indiv_5554    0     233
indiv_5565  881       0
indiv_5577    0     215
indiv_5841  451       0
indiv_5864  275       0
indiv_5879    0     278
indiv_5893    0     976
indiv_5936    0     193
indiv_5939 1016       0
indiv_5945  422       0
indiv_5958    0     826
indiv_5976    0     106
indiv_5978  264       0
indiv_6032    0     289
indiv_6033  528       0

Using eSVD-DE

We will be testing for differential expression in the mean between the 15 donors with ASD and the 16 control donors, among the 6599 genes.

Step 0 (Formatting the covariates): Before we use eSVD-DE, we want to reformat all the covariates. Typically, it is a good idea to set all the categorical covariates to be factor variables where the last level has the smallest size. This is because that last level will be the one that does not get its own indicator column during the one-hot encoding when using eSVD2::format_covariates().

gene_vec <- Seurat::VariableFeatures(asd[["RNA"]])
mat <- SeuratObject::LayerData(asd,
                               layer = "counts",
                               assay = "RNA",
                               features = gene_vec)
mat <- Matrix::t(mat)
covariate_df <- asd@meta.data[, c("individual",
                                  "region",
                                  "age",
                                  "sex",
                                  "diagnosis",
                                  "Seqbatch",
                                  "Capbatch")]

factor_variables <- c("region", "diagnosis", "sex", "Seqbatch", "Capbatch")
for (variable in factor_variables) {
  covariate_df[, variable] <- factor(covariate_df[, variable], levels = names(sort(table(covariate_df[, variable]), decreasing = T)))
}

covariate_df[, "diagnosis"] <- factor(covariate_df[, "diagnosis"], levels = c("Control", "ASD"))

covariates <- eSVD2::format_covariates(
  dat = mat,
  covariate_df = covariate_df,
  rescale_numeric_variables = "age"
)

Step 1 (Initializing the eSVD object): First, we initialize the object of class eSVD, which we will call eSVD_obj. This is done using the eSVD2::initialize_esvd() function. All the following operations will use eSVD_obj for all calculations through the usage of this method. This is where we pass the main inputs dat, covariates, and metadata_individual. Notice that after eSVD2::initialize_esvd() (and later, after eSVD2::opt_esvd()) we always call eSVD2::reparameterization_esvd_covariates() to ensure that the fit remains identifiable.

Empirically, we have found it useful to set omitted_variables as the Log_UMI and the batch variables. This means that when eSVD2 reparameterizes the variables, it does not fiddle with the coefficient for Log_UMI. We found this beneficial since the coefficient for specifically the covariate Log_UMI bears a special meaning (as it impacts the normalization of the sequencing depth most directly), and we do not wish to "mix" covariates correlated with Log_UMI too early. We have also found it useful to keep the batch variables separate since some datasets we have analyzed have donor covariates that are very correlated with the batch variable, and keeping these two aspects (i.e., the biological effect of donor covariates, and the artifical effect of the sequencing batches) seperate to be beneficial. (Overall, this is not a strict recommendation, however, we have found this practice to be beneficial in almost all instances we have used eSVD2, so we ourselves have not tried deviating from this convention.)

On a single core of a 2023 Macbook Pro (Apple M2 Max, 32 Gb RAM), this took roughly 7 minutes.

time_start1 <- Sys.time()
eSVD_obj <- eSVD2::initialize_esvd(
  dat = mat,
  covariates = covariates[, -grep("individual", colnames(covariates))],
  metadata_individual = covariate_df[, "individual"],
  case_control_variable = "diagnosis_ASD",
  bool_intercept = TRUE,
  k = 30,
  lambda = 0.1,
  metadata_case_control = covariates[, "diagnosis_ASD"],
  verbose = 1
)
time_end1 <- Sys.time()

omitted_variables <- colnames(eSVD_obj$covariates)[c(grep("Seqbatch", colnames(eSVD_obj$covariates)), grep("Capbatch", colnames(eSVD_obj$covariates)))]
eSVD_obj <- eSVD2::reparameterization_esvd_covariates(
  input_obj = eSVD_obj,
  fit_name = "fit_Init",
  omitted_variables = c("Log_UMI", omitted_variables)
)

Step 2 (Fitting the eSVD object): We now fit the eSVD. Typically, we recommend fitting twice (i.e., calling eSVD2::opt_esvd() twice) due to the non-convex nature of the objective function. In the first call of eSVD2::opt_esvd(), hold all the coefficients to all the covariates aside from the case-control covariate (here, denoted as "CC" so we remove as much of the case-control covariate effect as possible. Then, in the second call of eSVD2::opt_esvd(), we allow all the coefficients to be optimized.

Note that in the last fit of eSVD2::opt_esvd(), we do not set any omitted_variables. This is our typical recommendation -- it is good to allow all coefficients to be reparameterized right before the optimization is all done.

On a single core of a 2023 Macbook Pro (Apple M2 Max, 32 Gb RAM), for this vignette's dataset, the first fit (i.e., call of eSVD2::opt_esvd()) took roughly 7 minutes (for 11 iterations). The second fit took roughly 4 minutes (for 15 iterations).

time_start2 <- Sys.time()
eSVD_obj <- eSVD2::opt_esvd(
  input_obj = eSVD_obj,
  l2pen = 0.1,
  max_iter = 100,
  offset_variables = setdiff(colnames(eSVD_obj$covariates), "diagnosis_ASD"),
  tol = 1e-6,
  verbose = 1,
  fit_name = "fit_First",
  fit_previous = "fit_Init"
)
time_end2 <- Sys.time()
eSVD_obj <- eSVD2::reparameterization_esvd_covariates(
  input_obj = eSVD_obj,
  fit_name = "fit_First",
  omitted_variables = c("Log_UMI", omitted_variables)
)

time_start3 <- Sys.time()
eSVD_obj <- eSVD2::opt_esvd(
  input_obj = eSVD_obj,
  l2pen = 0.1,
  max_iter = 100,
  offset_variables = NULL,
  tol = 1e-6,
  verbose = 1,
  fit_name = "fit_Second",
  fit_previous = "fit_First"
)
time_end3 <- Sys.time()
eSVD_obj <- eSVD2::reparameterization_esvd_covariates(
  input_obj = eSVD_obj,
  fit_name = "fit_Second",
  omitted_variables = omitted_variables
)

Step 3 (Estimating the nuisance (i.e., over-dispersion) parameters): After fitting the eSVD, we need to estimate the nuisance parameters. Since we are modeling the data via a negative binomial, this nuisance parameter is the over-dispersion parameter. This is done via the eSVD2::estimate_nuisance() function.

On a single core of a 2023 Macbook Pro (Apple M2 Max, 32 Gb RAM), this took roughly 9 minutes.

time_start4 <- Sys.time()
eSVD_obj <- eSVD2::estimate_nuisance(
  input_obj = eSVD_obj,
  bool_covariates_as_library = TRUE,
  bool_library_includes_interept = TRUE,
  bool_use_log = FALSE,
  verbose = 1
)
time_end4 <- Sys.time()

Step 4 (Computing the posterior): With the over-disperison parameter estimated, we are now ready to compute the posterior mean and variance of each cell's gene expression value. This posterior distribution is to balance out the fitted eSVD denoised value against the covariate-adjusted library size. This step ensures that even though we pooled all the genes to denoise the single-cell expression matrix, we do not later inflate the Type-1 error. This is done via the eSVD2::compute_posterior() function.

eSVD_obj <- eSVD2::compute_posterior(
  input_obj = eSVD_obj,
  bool_adjust_covariates = FALSE,
  alpha_max = 2 * max(mat@x),
  bool_covariates_as_library = TRUE,
  bool_stabilize_underdispersion = TRUE,
  library_min = 0.1,
  pseudocount = 0
)

Step 5 (Computing the test statistics): Now we can compute the test statistics using the eSVD2::compute_test_statistic() function.

eSVD_obj <- eSVD2::compute_test_statistic(input_obj = eSVD_obj, verbose = 1)

Step 6 (Computing the p-values): Lastly, we can compute the p-values via the eSVD2::compute_pvalue() function.

eSVD_obj <- eSVD2::compute_pvalue(input_obj = eSVD_obj)

save(eSVD_obj, session_info, date_of_run,
     file = "asd_esvd.RData")

Plotting the p-values

To visualize the p-values, we can plot a volcano plot where the x-axis is our estimated log fold change (based on the global case donor mean, where each donor's specific mean is first computed, minus the global control donor mean) and the y-axis the negative log 10 p-value. The dotted horizontal line denotes the FDR cutoff.

To give this volcano plot some biological interpretation, we show where the genes found in Gandal et al. (see https://www.nature.com/articles/s41586-022-05377-7) on the volcano plot as well. That study was based on the bulk sequencing of brain regions, where possibly many cell types are involved in the DEG analysis. Nonetheless, we would expect a good number of DEGs from Gandal et al. to be highly significant in our analysis, which is what our volcano plot shows.

pvalue_vec <- 10 ^ (-eSVD_obj$pvalue_list$log10pvalue)
tmp_df <- data.frame(
  gene = names(eSVD_obj$pvalue_list$log10pvalue),
  lfc = log2(eSVD_obj$case_mean) - log2(eSVD_obj$control_mean),
  pvalue = pvalue_vec
)
pCutoff <- max(pvalue_vec[which(eSVD_obj$pvalue_list$fdr_vec <= 0.05)])

data("gandal_df", package = "eSVD2")
gandal_genes <- gandal_df[which(gandal_df[, "WholeCortex_ASD_FDR"] <= 0.05), "external_gene_name"]
gandal_genes <- intersect(gandal_genes, tmp_df$gene)

tmp_df2 <- tmp_df[c(which(!tmp_df$gene %in% gandal_genes),
                    which(tmp_df$gene %in% gandal_genes)), ]

colCustom <- sapply(tmp_df2$gene, function(gene) {
  ifelse(gene %in% gandal_genes, "coral", "navy")
})
names(colCustom)[colCustom == "coral"] <- "Gandal et al."
names(colCustom)[colCustom == "navy"] <- "others"

# png(
#   "volcano.png",
#   width = 1500,
#   height = 2000,
#   res = 300,
#   units = "px"
# )
EnhancedVolcano::EnhancedVolcano(
  tmp_df2,
  lab = tmp_df2$gene,
  selectLab = gandal_genes,
  colCustom = colCustom,
  x = "lfc",
  y = "pvalue",
  ylim = c(0, max(eSVD_obj$pvalue_list$log10pvalue)),
  labSize = 3,
  pCutoff = pCutoff,
  FCcutoff = 0,
  drawConnectors = TRUE
)
# graphics.off()

knitr::include_graphics("https://github.com/linnykos/eSVD2/blob/master/vignettes/volcano.png?raw=true")

Alternatively, you can make the volcano plot that specifically highlights the housekeeping genes, which can act as a rough "negative control". That is, we would not expect many of the housekeeping genes to be significantly different between case and control donors. This is what we see in our volcano plot.

data("housekeeping_df", package = "eSVD2")
hk_genes <- housekeeping_df[, "Gene"]
hk_genes <- intersect(hk_genes, tmp_df$gene)

tmp_df2 <- tmp_df[c(which(!tmp_df$gene %in% hk_genes),
                    which(tmp_df$gene %in% hk_genes)), ]

colCustom <- sapply(tmp_df2$gene, function(gene) {
  ifelse(gene %in% hk_genes, "darkolivegreen1", "navy")
})
names(colCustom)[colCustom == "darkolivegreen1"] <- "Housekeeping genes"
names(colCustom)[colCustom == "navy"] <- "others"

# png(
#   "volcano_hk.png",
#   width = 1500,
#   height = 2000,
#   res = 300,
#   units = "px"
# )
EnhancedVolcano::EnhancedVolcano(
  tmp_df2,
  lab = tmp_df2$gene,
  selectLab = hk_genes,
  colCustom = colCustom,
  x = "lfc",
  y = "pvalue",
  ylim = c(0, max(eSVD_obj$pvalue_list$log10pvalue)),
  labSize = 3,
  pCutoff = pCutoff,
  FCcutoff = 0,
  drawConnectors = TRUE
)
# graphics.off()

knitr::include_graphics("https://github.com/linnykos/eSVD2/blob/master/vignettes/volcano_hk.png?raw=true")

Setup

The following shows the suggested package versions that the developer (GitHub username: linnykos) used when developing the eSVD2 package.

devtools::session_info()

linnykos/eSVD2 documentation built on July 17, 2024, 12:01 a.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

linnykos/eSVD2
eSVD2 for performing the eSVD-DE for cohort-wide differential expression analysis

In linnykos/eSVD2: eSVD2 for performing the eSVD-DE for cohort-wide differential expression analysis

Loading the data

Using eSVD-DE

Plotting the p-values

Setup

R Package Documentation

Browse R Packages

We want your feedback!

linnykos/eSVD2 eSVD2 for performing the eSVD-DE for cohort-wide differential expression analysis

In linnykos/eSVD2: eSVD2 for performing the eSVD-DE for cohort-wide differential expression analysis

Loading the data

Using eSVD-DE

Plotting the p-values

Setup

R Package Documentation

Browse R Packages

We want your feedback!

linnykos/eSVD2
eSVD2 for performing the eSVD-DE for cohort-wide differential expression analysis