knitr::opts_chunk$set(echo = TRUE)
For this vignette, we will be using a dataset that can be downloaded from https://www.dropbox.com/scl/fi/csyz62obxx8xj5gl5sgp4/asd_seurat.RData?rlkey=d7dlhqjesunal52u8z2iomjsg&dl=0.
This file is called asd_seurat.RData
, and contains the following variables:
asd
: The Seurat object we will be analyzing, containing 13,302 cells and 6,599 genes (all of which are deemed highly variable)date_of_run
: The date when asd
was constructedsession_info
: The session info when asd
was constructedNotably, this is not the version of the dataset used in the eSVD-DE paper. Instead, the intent of this vignette to show you how to perform an eSVD-DE analysis. That is, while the processed dataset in the paper is different from this vignette's processed dataset, the procedure we showcase is here is intended as the typical usage of our method.
If you wish to see how this data was preprocessed starting from publicly available data (or you cannot download the .RData
file), please see https://linnykos.github.io/eSVD2/articles/asd-preprocess.html.
Relevant citation:
Velmeshev, D., Schirmer, L., Jung, D., Haeussler, M., Perez, Y., Mayer, S., ... & Kriegstein, A. R. (2019). Single-cell genomics identifies cell type–specific molecular changes in autism. Science, 364(6441), 685-689.
library(Seurat) library(SeuratObject) library(EnhancedVolcano) set.seed(10) date_of_run <- Sys.time() session_info <- devtools::session_info()
We first load the data. We can use the table()
function to roughly see how many cells each donor contributes and how many donors were classified to be ASD
or Control
.
load("asd_seurat.RData") asd table(asd$individual, asd$diagnosis)
An object of class Seurat 6599 features across 13302 samples within 1 assay Active assay: RNA (6599 features, 6599 variable features) 2 layers present: counts, data
ASD Control indiv_1823 0 72 indiv_4341 0 733 indiv_4849 428 0 indiv_4899 349 0 indiv_5144 202 0 indiv_5163 0 145 indiv_5242 0 81 indiv_5278 944 0 indiv_5294 449 0 indiv_5387 0 1129 indiv_5391 0 237 indiv_5403 69 0 indiv_5408 0 162 indiv_5419 327 0 indiv_5531 631 0 indiv_5538 0 391 indiv_5554 0 233 indiv_5565 881 0 indiv_5577 0 215 indiv_5841 451 0 indiv_5864 275 0 indiv_5879 0 278 indiv_5893 0 976 indiv_5936 0 193 indiv_5939 1016 0 indiv_5945 422 0 indiv_5958 0 826 indiv_5976 0 106 indiv_5978 264 0 indiv_6032 0 289 indiv_6033 528 0
We will be testing for differential expression in the mean between the 15 donors with ASD and the 16 control donors, among the 6599 genes.
factor
variables where the last level has the smallest size. This is because that last level will be the one that does not get its own indicator column during the one-hot encoding when using eSVD2::format_covariates()
.gene_vec <- Seurat::VariableFeatures(asd[["RNA"]]) mat <- SeuratObject::LayerData(asd, layer = "counts", assay = "RNA", features = gene_vec) mat <- Matrix::t(mat) covariate_df <- asd@meta.data[, c("individual", "region", "age", "sex", "diagnosis", "Seqbatch", "Capbatch")] factor_variables <- c("region", "diagnosis", "sex", "Seqbatch", "Capbatch") for (variable in factor_variables) { covariate_df[, variable] <- factor(covariate_df[, variable], levels = names(sort(table(covariate_df[, variable]), decreasing = T))) } covariate_df[, "diagnosis"] <- factor(covariate_df[, "diagnosis"], levels = c("Control", "ASD")) covariates <- eSVD2::format_covariates( dat = mat, covariate_df = covariate_df, rescale_numeric_variables = "age" )
eSVD
, which we will call eSVD_obj
. This is done using the eSVD2::initialize_esvd()
function. All the following operations will use
eSVD_obj
for all calculations through the usage of this method. This is where we pass the main inputs dat
, covariates
, and metadata_individual
. Notice that after eSVD2::initialize_esvd()
(and later, after eSVD2::opt_esvd()
) we always call eSVD2::reparameterization_esvd_covariates()
to ensure that the fit remains identifiable.Empirically, we have found it useful to set omitted_variables
as the Log_UMI
and the batch variables. This means that when eSVD2 reparameterizes the variables, it does not fiddle with the coefficient for Log_UMI
. We found this beneficial since the coefficient for specifically the covariate Log_UMI
bears a special meaning (as it impacts the normalization of the sequencing depth most directly), and we do not wish to "mix" covariates correlated with Log_UMI
too early. We have also found it useful to keep the batch variables separate since some datasets we have analyzed have donor covariates that are very correlated with the batch variable, and keeping these two aspects (i.e., the biological effect of donor covariates, and the artifical effect of the sequencing batches) seperate to be beneficial. (Overall, this is not a strict recommendation, however, we have found this practice to be beneficial in almost all instances we have used eSVD2, so we ourselves have not tried deviating from this convention.)
On a single core of a 2023 Macbook Pro (Apple M2 Max, 32 Gb RAM), this took roughly 7 minutes.
time_start1 <- Sys.time() eSVD_obj <- eSVD2::initialize_esvd( dat = mat, covariates = covariates[, -grep("individual", colnames(covariates))], metadata_individual = covariate_df[, "individual"], case_control_variable = "diagnosis_ASD", bool_intercept = TRUE, k = 30, lambda = 0.1, metadata_case_control = covariates[, "diagnosis_ASD"], verbose = 1 ) time_end1 <- Sys.time() omitted_variables <- colnames(eSVD_obj$covariates)[c(grep("Seqbatch", colnames(eSVD_obj$covariates)), grep("Capbatch", colnames(eSVD_obj$covariates)))] eSVD_obj <- eSVD2::reparameterization_esvd_covariates( input_obj = eSVD_obj, fit_name = "fit_Init", omitted_variables = c("Log_UMI", omitted_variables) )
eSVD2::opt_esvd()
twice) due to the non-convex nature of the objective function. In the first call of eSVD2::opt_esvd()
, hold all the coefficients to all the covariates aside from the case-control covariate (here, denoted as "CC"
so we remove as much of the case-control covariate effect as possible. Then, in the second call of eSVD2::opt_esvd()
, we allow all the coefficients to be optimized.Note that in the last fit of eSVD2::opt_esvd()
, we do not set any omitted_variables
. This is our typical recommendation -- it is good to allow all coefficients to be reparameterized right before the optimization is all done.
On a single core of a 2023 Macbook Pro (Apple M2 Max, 32 Gb RAM), for this vignette's dataset, the first fit (i.e., call of eSVD2::opt_esvd()
) took roughly 7 minutes (for 11 iterations).
The second fit took roughly 4 minutes (for 15 iterations).
time_start2 <- Sys.time() eSVD_obj <- eSVD2::opt_esvd( input_obj = eSVD_obj, l2pen = 0.1, max_iter = 100, offset_variables = setdiff(colnames(eSVD_obj$covariates), "diagnosis_ASD"), tol = 1e-6, verbose = 1, fit_name = "fit_First", fit_previous = "fit_Init" ) time_end2 <- Sys.time() eSVD_obj <- eSVD2::reparameterization_esvd_covariates( input_obj = eSVD_obj, fit_name = "fit_First", omitted_variables = c("Log_UMI", omitted_variables) ) time_start3 <- Sys.time() eSVD_obj <- eSVD2::opt_esvd( input_obj = eSVD_obj, l2pen = 0.1, max_iter = 100, offset_variables = NULL, tol = 1e-6, verbose = 1, fit_name = "fit_Second", fit_previous = "fit_First" ) time_end3 <- Sys.time() eSVD_obj <- eSVD2::reparameterization_esvd_covariates( input_obj = eSVD_obj, fit_name = "fit_Second", omitted_variables = omitted_variables )
eSVD2::estimate_nuisance()
function.On a single core of a 2023 Macbook Pro (Apple M2 Max, 32 Gb RAM), this took roughly 9 minutes.
time_start4 <- Sys.time() eSVD_obj <- eSVD2::estimate_nuisance( input_obj = eSVD_obj, bool_covariates_as_library = TRUE, bool_library_includes_interept = TRUE, bool_use_log = FALSE, verbose = 1 ) time_end4 <- Sys.time()
eSVD2::compute_posterior()
function.eSVD_obj <- eSVD2::compute_posterior( input_obj = eSVD_obj, bool_adjust_covariates = FALSE, alpha_max = 2 * max(mat@x), bool_covariates_as_library = TRUE, bool_stabilize_underdispersion = TRUE, library_min = 0.1, pseudocount = 0 )
eSVD2::compute_test_statistic()
function.eSVD_obj <- eSVD2::compute_test_statistic(input_obj = eSVD_obj, verbose = 1)
eSVD2::compute_pvalue()
function.eSVD_obj <- eSVD2::compute_pvalue(input_obj = eSVD_obj) save(eSVD_obj, session_info, date_of_run, file = "asd_esvd.RData")
To visualize the p-values, we can plot a volcano plot where the x-axis is our estimated log fold change (based on the global case donor mean, where each donor's specific mean is first computed, minus the global control donor mean) and the y-axis the negative log 10 p-value. The dotted horizontal line denotes the FDR cutoff.
To give this volcano plot some biological interpretation, we show where the genes found in Gandal et al. (see https://www.nature.com/articles/s41586-022-05377-7) on the volcano plot as well. That study was based on the bulk sequencing of brain regions, where possibly many cell types are involved in the DEG analysis. Nonetheless, we would expect a good number of DEGs from Gandal et al. to be highly significant in our analysis, which is what our volcano plot shows.
pvalue_vec <- 10 ^ (-eSVD_obj$pvalue_list$log10pvalue) tmp_df <- data.frame( gene = names(eSVD_obj$pvalue_list$log10pvalue), lfc = log2(eSVD_obj$case_mean) - log2(eSVD_obj$control_mean), pvalue = pvalue_vec ) pCutoff <- max(pvalue_vec[which(eSVD_obj$pvalue_list$fdr_vec <= 0.05)]) data("gandal_df", package = "eSVD2") gandal_genes <- gandal_df[which(gandal_df[, "WholeCortex_ASD_FDR"] <= 0.05), "external_gene_name"] gandal_genes <- intersect(gandal_genes, tmp_df$gene) tmp_df2 <- tmp_df[c(which(!tmp_df$gene %in% gandal_genes), which(tmp_df$gene %in% gandal_genes)), ] colCustom <- sapply(tmp_df2$gene, function(gene) { ifelse(gene %in% gandal_genes, "coral", "navy") }) names(colCustom)[colCustom == "coral"] <- "Gandal et al." names(colCustom)[colCustom == "navy"] <- "others" # png( # "volcano.png", # width = 1500, # height = 2000, # res = 300, # units = "px" # ) EnhancedVolcano::EnhancedVolcano( tmp_df2, lab = tmp_df2$gene, selectLab = gandal_genes, colCustom = colCustom, x = "lfc", y = "pvalue", ylim = c(0, max(eSVD_obj$pvalue_list$log10pvalue)), labSize = 3, pCutoff = pCutoff, FCcutoff = 0, drawConnectors = TRUE ) # graphics.off()
knitr::include_graphics("https://github.com/linnykos/eSVD2/blob/master/vignettes/volcano.png?raw=true")
Alternatively, you can make the volcano plot that specifically highlights the housekeeping genes, which can act as a rough "negative control". That is, we would not expect many of the housekeeping genes to be significantly different between case and control donors. This is what we see in our volcano plot.
data("housekeeping_df", package = "eSVD2") hk_genes <- housekeeping_df[, "Gene"] hk_genes <- intersect(hk_genes, tmp_df$gene) tmp_df2 <- tmp_df[c(which(!tmp_df$gene %in% hk_genes), which(tmp_df$gene %in% hk_genes)), ] colCustom <- sapply(tmp_df2$gene, function(gene) { ifelse(gene %in% hk_genes, "darkolivegreen1", "navy") }) names(colCustom)[colCustom == "darkolivegreen1"] <- "Housekeeping genes" names(colCustom)[colCustom == "navy"] <- "others" # png( # "volcano_hk.png", # width = 1500, # height = 2000, # res = 300, # units = "px" # ) EnhancedVolcano::EnhancedVolcano( tmp_df2, lab = tmp_df2$gene, selectLab = hk_genes, colCustom = colCustom, x = "lfc", y = "pvalue", ylim = c(0, max(eSVD_obj$pvalue_list$log10pvalue)), labSize = 3, pCutoff = pCutoff, FCcutoff = 0, drawConnectors = TRUE ) # graphics.off()
knitr::include_graphics("https://github.com/linnykos/eSVD2/blob/master/vignettes/volcano_hk.png?raw=true")
The following shows the suggested package versions that the developer (GitHub username: linnykos) used when developing the eSVD2 package.
devtools::session_info()
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.