Objective: Implementation idea on adding the ability to specify multiple genes next to each other to the draw_boxplot()

Original draw_boxplot()

Here we just have faceting and stratification by other colData variables.

object <- HermesData(summarized_experiment)

assay_name = "counts"
x_var = "SEX"
y_var = genes(object)[19]
facet_var = "RACE"
color_var = "AGE18"

assay_matrix <- assay(object, assay_name)
col_data <- colData(object)
df <- data.frame(
  x = col_data[, x_var],
  y = assay_matrix[y_var, ]
) 
df$color <- col_data[[color_var]]
df$facet <- col_data[[facet_var]]


ggplot(df, aes(x = .data$x, y = .data$y)) +
  geom_boxplot() +
  stat_boxplot(geom = "errorbar") +
  labs(x = x_var, y = y_var) +
  geom_point(aes(color = .data$color)) +
      labs(color = color_var) +
  facet_wrap(~facet)

Update to add a way to specify multiple genes

Option 1: for loop

We could do a for loop for y_var and keep x_var and facet_var to allow the user to be able to see distribution across different genes specified in the y_var by x_var and facet_var.

y_vars = genes(object)[1:2]
facet_vars = genes(object)[1:2]

for (thisyvar in y_vars) {
  df <- data.frame(
    x = col_data[, x_var],
    y = assay_matrix[thisyvar, ]
  )
  df$color <- col_data[[color_var]]
  df$facet <- col_data[[facet_var]]

  print(ggplot(df, aes(x = .data$x, y = .data$y)) +
  geom_boxplot() +
  stat_boxplot(geom = "errorbar") +
  labs(x = x_var, y = thisyvar) +
  geom_point(aes(color = .data$color)) +
      labs(color = color_var) +
  facet_wrap(~facet)
  )
}

# Test that this approach works when only one gene is specified
for (thisyvar in y_var) {
  df <- data.frame(
    x = col_data[, x_var],
    y = assay_matrix[thisyvar, ]
  )
  df$color <- col_data[[color_var]]
  df$facet <- col_data[[facet_var]]

  print(ggplot(df, aes(x = .data$x, y = .data$y)) +
  geom_boxplot() +
  stat_boxplot(geom = "errorbar") +
  labs(x = x_var, y = thisyvar) +
  geom_point(aes(color = .data$color)) +
      labs(color = color_var) +
  facet_wrap(~facet)
  )
}

However, the problem here is that it will be difficult to compare within a facet the different gene distributions as it is too far away visually. Also for multiple genes there will be too many plots on the page.

Option 2: Use fill aesthetic for the genes

The idea here is that we can add an additional aesthetic on top of the existing ones. Here the fill aesthetic is a natural choice as it allows to plot the gene specific boxes next to each other.

In order to achieve this we need to use a long format representation of the different gene expression values.

object <- HermesData(summarized_experiment)

assay_name <- "counts"
x_var <- "SEX"
y_var <- genes(object)[19]
y_vars <- genes(object)[1:2]
facet_var <- "RACE"
color_var <- "AGE18"

assay_matrix <- assay(object, assay_name)
col_data <- colData(object)
df2 <- data.frame(
  x = col_data[, x_var],
  color = col_data[, color_var],
  facet = col_data[, facet_var],
  y = as.numeric(t(assay_matrix[y_vars, , drop = FALSE])),
  fill = factor(rep(y_vars, each = ncol(assay_matrix)))
) 

Note that in the data.frame() call above, the three colData variables are repeated for each gene (as specified in y_vars). Therefore we need to transpose the subset of the assay matrix such that each gene has a column and the rows represent the samples. When we then coerce to a numeric vector the columns are bound together, we have all expression values for the first gene, then for the second gene, etc. This then matches the colData repetitions correctly.

Given the long format it is almost as simple as before to plot the boxplots, as we just add the additional aesthetic.

ggplot(df2, aes(x = .data$x, y = .data$y, fill = .data$fill)) +
  geom_boxplot() +
  stat_boxplot(geom = "errorbar") +
  geom_point(
    position = position_jitterdodge(jitter.width = 0),
    aes(color = .data$color, group = .data$fill)
  ) +
  labs(x = x_var, y = assay_name, color = color_var, fill = "Gene") +
  facet_wrap(~ facet)

We just need to make sure the points position uses the position_jitterdodge() function, see https://ggplot2.tidyverse.org/reference/position_jitterdodge.html. Depending on whether we want to jitter the points or not we can set jitter_width to 0 or leave at default (which uses jittering).

For the y-axis label we can show the assay name, since we show the gene ID(s) now in the legend of the plot.



insightsengineering/hermes documentation built on Dec. 15, 2024, 8:07 a.m.