calculate_markers: Calculate markers
In Core-Bioinformatics/ClustAssess: Tools for Assessing Clustering

calculate_markers

R Documentation

Calculate markers

Description

Performs the Wilcoxon rank sum test to identify differentially expressed genes between two groups of cells.

Usage

calculate_markers(
  expression_matrix,
  cells1,
  cells2,
  logfc_threshold = 0,
  min_pct_threshold = 0.1,
  avg_expr_threshold_group1 = 0,
  min_diff_pct_threshold = -Inf,
  rank_matrix = NULL,
  feature_names = NULL,
  used_slot = "data",
  norm_method = "SCT",
  pseudocount_use = 1,
  base = 2,
  adjust_pvals = TRUE,
  check_cells_set_diff = TRUE
)

Arguments

`expression_matrix`	A matrix of gene expression values having genes in rows and cells in columns.
`cells1`	A vector of cell indices for the first group of cells.
`cells2`	A vector of cell indices for the second group of cells.
`logfc_threshold`	The minimum absolute log fold change to consider a gene as differentially expressed. Defaults to `0`, meaning all genes are taken into considereation.
`min_pct_threshold`	The minimum fraction of cells expressing a gene form each cell population to consider the gene as differentially expressed. Increasing the value will speed up the function. Defaults to `0.1`.
`avg_expr_threshold_group1`	The minimum average expression that a gene should have in the first group of cells to be considered as differentially expressed. Defaults to `0`.
`min_diff_pct_threshold`	The minimum difference in the fraction of cells expressing a gene between the two cell populations to consider the gene as differentially expressed. Defaults to `-Inf`.
`rank_matrix`	A matrix where the cells are ranked based on their expression levels with respect to each gene. Defaults to `NULL`, in which case the function will calculate the rank matrix. We recommend calculating the rank matrix beforehand and passing it to the function to speed up the computation.
`feature_names`	A vector of gene names. Defaults to `NULL`, in which case the function will use the row names of the expression matrix as gene names.
`used_slot`	Parameter that provides additional information about the expression matrix, whether it was scaled or not. The value of this parameter impacts the calculation of the fold change. If `data`, the function will calculates the fold change as the fraction between the log value of the average of the expression raised to exponential for the two cell groups. If `scale.data`, the function will calculate the fold change as the fraction between the average of the expression values for the two cell groups. Other options will default to calculating the fold change as the fraction between the log value of the average of the expression values for the two cell groups. Defaults to `data`.
`norm_method`	The normalization method used to normalize the expression matrix. The value of this parameter impacts the calculation of the average expression of the genes when `used_slot = "data"`. If `LogNormalize`, the log fold change will be calculated as described for the `used_slot` parameter. Otherwise, the log fold change will be calculated as the fraction between the log value of the average of the expression values for the two cell groups. Defaults to `SCT`.
`pseudocount_use`	The pseudocount to add to the expression values when calculating the average expression of the genes, to avoid the 0 value for the denominator. Defaults to `1`.
`base`	The base of the logharithm. Defaults to `2`.
`adjust_pvals`	A logical value indicating whether to adjust the p-values for multiple testing using the Bonferonni method. Defaults to `TRUE`.
`check_cells_set_diff`	A logical value indicating whether to check if thw two cell groups are disjoint or not. Defaults to `TRUE`.

Value

A data frame containing the following columns:

gene: The gene name.
avg_log2FC: The average log fold change between the two cell groups.
p_val: The p-value of the Wilcoxon rank sum test.
p_val_adj: The adjusted p-value of the Wilcoxon rank sum test.
pct.1: The fraction of cells expressing the gene in the first cell group.
pct.2: The fraction of cells expressing the gene in the second cell group.
avg_expr_group1: The average expression of the gene in the first cell group.

Examples

set.seed(2024)
# create an artificial expression matrix
expr_matrix <- matrix(
    c(runif(100 * 50), runif(100 * 50, min = 3, max = 4)),
    ncol = 200, byrow = FALSE
)
colnames(expr_matrix) <- as.character(1:200)
rownames(expr_matrix) <- paste("feature", 1:50)

calculate_markers(
    expression_matrix = expr_matrix,
    cells1 = 101:200,
    cells2 = 1:100
)
# TODO should be rewritten such that you don't create new matrix objects inside
# just

Core-Bioinformatics/ClustAssess documentation built on Feb. 18, 2025, 10:20 a.m.