# Chunk opts knitr::opts_chunk$set( collapse = TRUE, comment = "#>", warning = FALSE, message = FALSE, fig.width = 8, fig.height = 5 )
This vignette provides detailed examples for clustering cells based on CDR3 sequences. For the examples shown below, we use data for splenocytes from BL6 and MD4 mice collected using the 10X Genomics scRNA-seq platform. MD4 B cells are monoclonal and specifically bind hen egg lysozyme.
library(djvdj) library(Seurat) library(ggplot2) library(RColorBrewer) # Load GEX data data_dir <- system.file("extdata/splen", package = "djvdj") gex_dirs <- c( BL6 = file.path(data_dir, "BL6_GEX/filtered_feature_bc_matrix"), MD4 = file.path(data_dir, "MD4_GEX/filtered_feature_bc_matrix") ) so <- gex_dirs |> Read10X() |> CreateSeuratObject() |> AddMetaData(splen_meta) # Add V(D)J data to object vdj_dirs <- c( BL6 = system.file("extdata/splen/BL6_BCR", package = "djvdj"), MD4 = system.file("extdata/splen/MD4_BCR", package = "djvdj") ) so <- so |> import_vdj(vdj_dirs, define_clonotypes = "cdr3_gene")
The cluster_sequences()
function can be used to cluster cells based on CDR3 sequences, or any other sequences present in the object. Provide the meta.data column containing sequences to the data_col
argument. By default if a cell has multiple chains, the sequences will be concatenated.
By default, distances are calculated for amino acid sequences using the BLOSUM62 substitution matrix, which is based on observed amino acid frequencies and substitution probabilities. The calculated distances are then used to cluster cells using either the Louvain or Leiden clustering algorithms. The coarseness of the clusters an be adjusted using the resolution
argument with smaller values returning fewer clusters.
In this example we are clustering cells based on the CDR3 amino acid sequence.
so_vdj <- so |> cluster_sequences( data_col = "cdr3", method = "louvain", # clustering method dist_method = "BLOSUM62", # method for calculating sequence distances resolution = 0.5 )
Use the chain
argument to cluster using only sequences from a specific chain.
so_vdj <- so |> cluster_sequences( data_col = "cdr3", chain = "IGK" )
By default the Uniform Manifold Approximation and Projection (UMAP) dimensional reduction method will be performed and UMAP coordinates will be added to the object. To skip this step, set the run_umap
argument to FALSE
. This requires the uwot package to be installed.
so_vdj <- so |> cluster_sequences( data_col = "cdr3", chain = "IGK", run_umap = FALSE )
To return clustering results for multiple resolutions, a vector can be provided to the resolution
argument.
set.seed(42) so_vdj <- so |> cluster_sequences( data_col = "cdr3", chain = "IGK", resolution = c(0.4, 0.8, 1.6) )
Clustering results can be visualized on a UMAP projection using the generic plotting function plot_scatter()
. Colors can be adjusted using the plot_colors
argument.
This function will often generate a warning saying that rows with missing values were removed, this is expected since some cells do not have V(D)J data and will not have UMAP coordinates.
clrs <- setNames(brewer.pal(11, "Paired"), 1:11) so_vdj |> plot_scatter( x = "cdr3_UMAP_1", y = "cdr3_UMAP_2", data_col = "cdr3_cluster_0.4", plot_colors = clrs )
To visualize the proportion of BL6 and MD4 cells in each cluster we can create a stacked bargraph using the plot_frequency()
function. MD4 cells are monoclonal and as expected these cells are found almost exclusively in a single cluster.
so_vdj |> plot_frequency( data_col = "cdr3_cluster_0.4", cluster_col = "sample", plot_colors = clrs )
so_vdj |> filter_vdj(n_chains <= 2) |> plot_histogram( data_col = "cdr3_length", cluster_col = "cdr3_cluster_0.4", group_col = "cdr3_cluster_0.4", panel_scales = "free_y", chain = "IGK", plot_colors = clrs ) so_vdj |> plot_scatter( data_col = "cdr3_cluster_0.4", group_col = "cdr3_cluster_0.4", plot_colors = clrs )
so_vdj |> plot_scatter( data_col = "cdr3_length", chain = "IGK" ) so_vdj |> plot_scatter( data_col = "cdr3_cluster_0.4", chain = "IGK" )
Using the plot_motifs()
function we can generate sequence motifs for each cluster. We just need to provide the same data_col
and chain
used for clustering. To create a separate motif for each cluster, we also need to provide the column containing cluster IDs to the cluster_col
argument.
As expected we see that most cells within the MD4 cluster have the exact same IGK CDR3 sequence.
so_vdj |> plot_motifs( data_col = "cdr3", cluster_col = "cdr3_cluster_0.4", chain = "IGK" )
By default, sequences are aligned to the 5' end and trimmed based on the width
parameter. Sequences can be aligned to the 3' end using the align_end
parameter. Sequences longer than the width cutoff are trimmed, sequences shorter than the width cutoff are removed. By default the width cutoff is automatically selected for each cluster to include at least 75% of sequences.
In this example we generate motifs for the last 11 amino acids of the CDR3.
so_vdj |> plot_motifs( data_col = "cdr3", cluster_col = "cdr3_cluster_0.4", chain = "IGK", align_end = "3", width = 11 )
Plot colors can be modified using the plot_colors
argument and the number of rows used to arrange panels can be adjusted with the panel_nrow
argument. Like most djvdj plotting functions, plot_motifs()
will return a ggplot2 object that can be modified with other ggplot2 functions.
so_vdj |> plot_motifs( data_col = "cdr3", cluster_col = "cdr3_cluster_0.4", chain = "IGK", plot_colors = brewer.pal(5, "Set1"), panel_nrow = 4 ) + theme( axis.text.x = element_blank(), axis.ticks.x = element_blank() )
sessionInfo()
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.