In rnabioco/djvdj: A collection of single-cell V(D)J tools

# Chunk opts
knitr::opts_chunk$set(
  collapse   = TRUE,
  comment    = "#>",
  warning    = FALSE,
  message    = FALSE,
  fig.width  = 7,
  fig.height = 4
)

This vignette provides detailed examples for plotting V(D)J data imported into a single-cell object using djvdj. For the examples shown below, we use data for splenocytes from BL6 and MD4 mice collected using the 10X Genomics scRNA-seq platform. MD4 B cells are monoclonal and specifically bind hen egg lysozyme.

library(djvdj)
library(Seurat)
library(ggplot2)

# Load GEX data
data_dir <- system.file("extdata/splen", package = "djvdj")

gex_dirs <- c(
  BL6 = file.path(data_dir, "BL6_GEX/filtered_feature_bc_matrix"),
  MD4 = file.path(data_dir, "MD4_GEX/filtered_feature_bc_matrix")
)

so <- gex_dirs |>
  Read10X() |>
  CreateSeuratObject() |>
  AddMetaData(splen_meta)

# Add V(D)J data to object
vdj_dirs <- c(
  BL6 = system.file("extdata/splen/BL6_BCR", package = "djvdj"),
  MD4 = system.file("extdata/splen/MD4_BCR", package = "djvdj")
)

so <- so |>
  import_vdj(
    vdj_dirs,
    define_clonotypes = "cdr3_gene",
    include_mutations = TRUE
  )

Histograms

The plot_histogram() function can be used to plot numerical V(D)J data present in the object. By default, this will plot the values present in the data_col column for every cell. If data_col contains per-chain data (i.e. the column contains the character specified by sep, default is ';'), the values will be summarized for each cell. By default the mean will be calculated. Cells that lack V(D)J data are removed before plotting, so for this example we have r nrow(na.omit(so@meta.data)) cells with V(D)J data. The trans argument can be used to specify an x-axis transformation.

In the example below, the mean number of UMIs per cell is plotted.

so |>
  plot_histogram(
    data_col = "umis",
    trans    = "log10"
  )

To color cells based on an additional variable present in the object, specify the column name to the cluster_col argument.

so |>
  plot_histogram(
    data_col    = "umis",
    cluster_col = "orig.ident",
    trans       = "log10"
  )

The function used to summarize per-chain values can be changed using the summary_fn argument. The per-chain values can also be plotted separately using the per_chain argument.

In the example below, the value for each chain is plotted.

so |>
  plot_histogram(
    data_col    = "umis",
    cluster_col = "orig.ident",
    per_chain   = TRUE,
    trans       = "log10"
  )

To only plot values for a specific chain, the chain argument can be used. In this example the mean number of UMIs is shown for each IGK chain.

so |>
  plot_histogram(
    data_col    = "umis",
    cluster_col = "orig.ident",
    chain       = "IGK",
    trans       = "log10"
  )

Violin plots

The plot_violin() function has similar functionality as plot_histogram() but will generate violin plots or boxplots. For all djvdj plotting functions additional arguments can be passed directly to ggplot2 to further modify plot aesthetics. By default clusters are arranged highest to lowest based on the values in data_col. This ordering can be modified using the plot_lvls argument"

vln <- so |>
  plot_violin(
    data_col    = "nCount_RNA",
    cluster_col = "sample",
    trans       = "log10",

    draw_quantiles = c("0.25", "0.75")  # parameter for ggplot2::geom_violin()
  )

bx <- so |>
  plot_violin(
    data_col    = "nFeature_RNA",
    cluster_col = "sample",
    method      = "boxplot"
  )

vln + bx

UMAP projections

To plot V(D)J information on UMAP projections, the plot_scatter() function can be used. By default this function will plot the 'UMAP_1' and 'UMAP_2' columns on the x- and y-axis. A function to use for summarizing per-chain values for each cell must be specified using the summary_fn argument, by default this is mean() for numeric data. Cells that lack V(D)J data are stored as NAs in the object and will be shown on the plot as light grey. The color used for plotting NAs can be modified with the na_color argument.

In the example below, the mean number of mismatch mutations for each cell is shown.

so |>
  plot_scatter(
    data_col = "all_mis",
    na_color = "lightgrey"
  )

Instead of summarizing the per-chain values for each cell, we can also specify a chain to use for plotting. In this example we are plotting the CDR3 length for IGK chains. If a cell does not have an IGK chain or has multiple IGK chains, it will be plotted as an NA. Like other plotting functions, values can be transformed using the trans argument.

so |>
  plot_scatter(
    data_col = "reads",
    chain    = "IGK",
    trans    = "log10"
  )

Cell clusters can be outlined by setting the outline argument, this can help visualize points that are lightly colored.

u1 <- so |>
  plot_scatter(
    data_col    = "reads",
    plot_colors = c("white", "lightblue", "red"),
    trans       = "log10",
    size        = 2,
    outline     = TRUE
  )

u2 <- so |>
  plot_scatter(
    data_col    = "orig.ident",
    plot_colors = c("orange", "lightyellow"),
    size        = 2,
    outline     = TRUE
  )

u1 + u2

Plotting top clusters

To only plot the top clusters, the number of clusters to include can be specified with the top argument. For plot_violin(), clusters are ranked based on values in the data_col column. For plot_scatter(), clusters are ranked based on number of cells. Clusters can also be specified by passing a vector of cluster names. The remaining cells are labeled using the other_label argument.

# Show the top 3 clusters with highest values for nFeature_RNA
bx <- so |>
  plot_violin(
    data_col    = "nFeature_RNA",
    cluster_col = "seurat_clusters",
    method      = "boxplot",
    top         = 3
  )

# Select top clusters by passing a vector of names
u <- so |>
  plot_scatter(
    data_col = "seurat_clusters",
    top      = c("4", "1", "3")
  )

bx + u

Splitting plots

Plots can be split into separate panels based on an additional grouping variable using the group_col argument. The arrangement of plot panels and the axis scales used for each panel can be adjusted using the panel_nrow and panel_scales arguments.

so |>
  plot_histogram(
    data_col    = "reads",
    cluster_col = "orig.ident",
    group_col   = "seurat_clusters",
    trans       = "log10",

    panel_nrow = 2,
    panel_scales = "free_x"
  )

Plot aesthetics

Plot colors can be specified using the plot_colors argument. This should be a vector of colors, to specify colors by cluster name, a named vector can be passed. By default clusters are ordered with the most abundant clusters on top. To modify this ordering, a character vector can be passed to the plot_lvls argument.

so |>
  plot_scatter(
    data_col    = "orig.ident",
    plot_colors = c(MD4 = "red", BL6 = "lightblue")
  )

To only modify the color of a few clusters, a vector only containing the clusters of interest can be passed. The same is true for the plot_lvls argument, the name of a single cluster can be passed to plot it on top.

In the example below we keep all the default colors except for clusters 3 and 4 and we specifically plot these clusters on top.

so |>
  plot_scatter(
    data_col    = "seurat_clusters",
    plot_colors = c("4" = "darkred", "3" = "red"),
    plot_lvls   = c("3", "4")
  )

By default the number of data points plotted will be shown in the top right corner, plot legend, and/or x-axis. The location of the label can be specified using the n_label argument. Label appearance can be modified by passing a named list of aesthetic parameters to the label_params argument.

so |>
  plot_scatter(
    data_col     = "seurat_clusters",
    n_label      = "corner",
    label_params = list(color = "darkred", fontface = "bold")
  )