knitr::opts_chunk$set( collapse = TRUE, comment = "#>", fig.height = 5, fig.width = 5, fig.align = "center" ) set.seed(100)
GLIPH (Grouping of Lymphocyte Interactions by Paratope Hotspots) is an algorithm developed by Glanville et al. to identify specificity groups in the T cell receptor repertoire based on local (motif sharing) and global (hamming distance) similarities.(^1)(^,)(^2)
The authors of the paper published a version of GLIPH written in Perl(^2), but its disadvantage is a long runtime, especially for larger sample sizes. Recently, with a new version of the package, GLIPH2, they introduced an improved algorithm that speeds up the analysis by substituting repeated sampling by Fisher's exact test to assess the statistical significance of a given motif.(^3)(^,)(^4)
In this R implementation of GLIPH we followed their Perl script and we tried to include all the options of the original algorithm, but with a constant look on the analysis speed. With an input size of ~ $10^4$, this R implementation is about 50 times faster and for an input size of ~ $10^5$ sequences it is about 500 times faster than the Perl script. In addition, we implemented a function for visualizing the clusters generated by GLIPH, which enables graphical processing and analysis of GLIPH results through numerous customization options. The implementation of GLIPH2 as well as the function gliph_combined, which combines the different features of GLIPH and GLIPH2 in a customizable way, complete this package and allow the user to use the different GLIPH algorithms according to his own needs.
For more details about the method we recommend reading the original papers and the GLIPH/GLIPH2 documentation on github.(^1)(^,)(^2)(^,)(^3)(^,)(^4)
turboGliph can be installed as part of the Bioconductor repository with the following code from the R terminal. In the first lines, if not already done, the basic Bioconductor packages are installed. Once this is done, the last line can be used to install the turboGliph package.
if (!requireNamespace("BiocManager", quietly = TRUE)){ install.packages("BiocManager") BiocManager::install() } BiocManager::install("turboGliph")
In this introduction we will use gliph_input_data, the dataset published with GLIPH.(^1) First, let's take a look on the required format of the input data.
library(turboGliph) data("gliph_input_data") gliph_input_data[1:3,]
To use turbo_gliph, gliph2 and gliph_combined, CDR3 beta sequences are required. These must be specified either by a character vector or in a dataframe in a column named CDR3b. Additional information is not required for clustering, but is recommended for automatic cluster scoring. This includes the following information:
Notice: For turbo_gliph, gliph2 and gliph_combined the column names are important, the order of the columns can be arbitrary. In this example, four additional columns (TRBJ, CDR3a, TRAV, TRAJ) are given but will not be used by the GLIPH algorithms. However, this additional information is stored by the functions and will be visible in the returned clusters.The additional columns do not affect the algorithms as long as all column names specified in the input and output in this tutorial are avoided.
By default, the GLIPH algorithms in this package use the naive reference repertoire from the original GLIPH publication. However, there are several other reference repertoires available on the GLIPH2 website(^4) for different T cell subtypes as well as different species, which can be used in this package as well as user-created reference repertoires. For this, the reference sequences must be passed to the functions as a data frame under the parameter refdb_beta. In any case, a column named CDR3b is required which contains the sequences. Optionally (but mandatory for some parameter settings), a column named TRBV is expected to contain the V genes of the sequences. Additional information is ignored by the functions.
The turbo_gliph function is a runtime-efficient implementation of the GLIPH algorithm provided by Glanville et al. in their publication as a Perl script.(^1)(^,)(^2) The choice of the R programming language, a restructuring of some parts of the original code, and the incorporated option of parallelization allow a significant reduction in runtime, especially with respect to larger sample input sizes. The ability to customize the algorithm using a variety of input parameters was retained and extended to include some parameters that are listed on the github website but not included in the Perl script. Details about the algorithm and the input parameters can be found in the original publication, the github website and the documentation of this package.
The sequences are simply analyzed by calling the function turbo_gliph with the input dataframe as cdr3_sequences parameter.
res_turbogliph <- turboGliph::turbo_gliph(cdr3_sequences = gliph_input_data, n_cores = 1)
The output of the function is a list containing the identified similarities, the clustering result and some intermediate results of the algorithm. In the following, the complex output is explained in more detail.
In this list element the results of the repeated random sampling are summarized as a data frame. The columns of this data frame represent the individual motifs named in the column name. The rows represent the individual simulations of the repeated random sampling with the reference repertoire, where the first row refers to the sample set. The cells of the data frame contain numerical values that represent the frequency of the motif (indicated by the column) in the respective simulation step or in the sample set (indicated by the row).
If the output is stored in a file via the input parameter result_folder, this part of the output can be found in the following file: kmer_resample_sim_depth_log.txt
The frequencies of six motifs in the sample set and in the first four simulations are given below as examples:
res_turbogliph$sample_log[1:5, 1:6]
This list element contains the summary of the statistical evaluation of the identified motifs and is itself a list containing, on the one hand, the summary of all motifs (all_motifs) and, on the other hand, the summary of the significantly enriched motifs (selected_motifs).
If the output is stored in a file via the input parameter result_folder, this part of the output can be found in the following files:
File name of selected_motifs : kmer_resample_sim_depthminplcminpovelcminove.txt
File name of all_motifs : kmer_resample_sim_depth_all_motifs.txt
In the following, the statistics of the first three significantly enriched motifs are presented as an example:
res_turbogliph$motif_enrichment$selected_motifs[1:3,]
The columns contain the following information:
In this data frame all connections of the sequences are summarized by local and global similarities. The first two columns contain the connected sequences, whereas the third column indicates the type of connection (local, global or singleton for sequences without connections).
If the output is stored in a file via the input parameter result_folder, this part of the output can be found in the following file: clone_network.txt
The following is an example of the first three connections:
res_turbogliph$connections[1:3,]
This data frame summarizes all identified clusters, their members and the individual scores of the clusters. If the output is stored in a file via the input parameter result_folder, this part of the output can be found in the following file: convergence_groups.txt
The following is an example of the information provided by three clusters:
res_turbogliph$cluster_properties[2:4,]
As shown, several pieces of information are provided for each cluster:
The elements of this list, named with the corresponding cluster tags, contain data frames with the members of the respective cluster and all additional information provided by the user at the beginning of the algorithm in the parameter cdr3_sequences. If the output is stored in a file via the input parameter result_folder, this part of the output can be found as data frame in the following file: cluster_member_details.txt
The following is an example of the information of the first two members of the first cluster:
res_turbogliph$cluster_list[[1]][1:2,]
This list contains all input parameters with which the run of the turbo_Gliph function was called to retrieve the settings for other methods. Also, the user can check again at a later time which settings he had chosen for the run.
If the output is stored in a file via the input parameter result_folder, this part of the output can be found as a data frame in the following file: parameter.txt
In April 2020, a second version of GLIPH, called GLIPH2, was released by Huang et al.. This version can be used firstly via a server interface and secondly via two compiled programs (on Centos and IOS). For completeness, we have also implemented this version in R and included it in this package based on the information provided in the paper.(^3)(^,)(^4)
GLIPH2 differs from GLIPH in the following ways, which are explained in more detail in the original publication and on the corresponding website:(^3)(^,)(^4)
The sequences are simply analyzed by calling the function gliph2 with the input dataframe as cdr3_sequences parameter.
res_gliph2 <- turboGliph::gliph2(cdr3_sequences = gliph_input_data, n_cores = 1)
The output of the function is a list containing the identified similarities, the result of the clustering and some intermediate results of the algorithm. In the following, the complex output is explained in more detail.
This list element contains the summary of the statistical evaluation of the identified motifs and is itself a list containing on the one hand the summary of all motifs (all_motifs) and on the other hand the summary of the significantly enriched motifs (selected_motifs).
If the output is stored in a file via the input parameter result_folder, this part of the output can be found in the following files:
File name of selected_motifs : local_similarities_minp_lcminpminovelcminovekmer_mindepthkmer_mindepth.txt
File name of all_motifs : all_motifs.txt
In the following, the statistics of the first three significantly enriched motifs are presented as an example:
res_gliph2$motif_enrichment$selected_motifs[1:3,]
The columns contain the following information:
This list element contains the summary of the statistical evaluation of the identified global structures. If the output is saved to a file via the input parameter result_folder, this part of the output can be found in the following file: global_similarities.txt
The following is an example of the statistics of the first three structures:
res_gliph2$global_enrichment[1:3,]
The columns contain the following information:
In this data frame, all connections of the sequences are clustered by local and global similarities. The first two columns contain the connected sequences, the third column the type of connection (local, global or singleton for sequences without connections) and the fourth column the tag of the cluster.
If the output is stored in a file via the input parameter result_folder, this part of the output can be found in the following file: clone_network.txt
The following is an example of the first three connections:
res_gliph2$connections[1:3,]
This data frame summarizes all identified clusters, their members and the individual scores of the clusters. If the output is saved to a file via the input parameter result_folder, this part of the output can be found in the following file: convergence_groups.txt
The following is an example of the information provided by the first three clusters:
res_gliph2$cluster_properties[1:3,]
As shown, several pieces of information are provided for each cluster:
The elements of this list, named with the corresponding cluster tags, contain data frames with the members of the respective cluster and all additional information provided by the user at the beginning of the algorithm in the parameter cdr3_sequences. If the output is stored in a file via the input parameter result_folder, this part of the output can be found as data frame in the following file: cluster_member_details.txt
The following is an example of the information of the first two members of the first cluster:
res_gliph2$cluster_list[[1]][1:2,]
This list contains all input parameters with which the run of the turbo_Gliph function was called to get the settings for other methods. Also, the user can check again at a later time which settings he had chosen for the run.
If the output is stored in a file via the input parameter result_folder, this part of the output can be found as a data frame in the following file: parameter.txt
With the publication of GLIPH2, Fisher's Exact Test was introduced in an attempt to compensate for the runtime disadvantage of GLIPH and to assign a significance value to the global similarities for better assessment. Additional changes, such as the altered clustering to clusters restricted to single motifs, prevent a comparability of the results of GLIPH and GLIPH2. With gliph_combined we provide the user with a function in which he can combine different features of both GLIPH versions for an individualized analysis adapted to his own goals. Additionally, we added the possibility to filter the global similarities similar to the local similarities based on thresholds for their significance for clustering.
The sequences are simply analyzed by calling the function gliph_combined with the input dataframe as cdr3_sequences parameter.
res_gliph_combined <- turboGliph::gliph_combined(cdr3_sequences = gliph_input_data, n_cores = 1)
The default settings of gliph_combined are similar to the analysis of turbogliph, except that motifs for local similarities must start within three amino acid positions. The following shows the settings that reflect the gliph2 function:
res_gliph_combined_2 <- turboGliph::gliph_combined(cdr3_sequences = gliph_input_data, min_seq_length = 0, local_method = "fisher", boost_local_significance = TRUE, global_method = "fisher", clustering_method = "GLIPH2.0", scoring_method = "GLIPH2.0", n_cores = 1)
The intention of the implementation of gliph_combined, was to combine the features of GLIPH and GLIPH2 in one function in an individualizable way. The following settings perform the search for local and global similarities with the Fisher's Exact Test, as in gliph2, but cluster and score the clusters afterwards as in turbogliph.
res_gliph_combined_3 <- turboGliph::gliph_combined(cdr3_sequences = gliph_input_data, min_seq_length = 0, local_method = "fisher", boost_local_significance = TRUE, global_method = "fisher", clustering_method = "GLIPH1.0", scoring_method = "GLIPH1.0", n_cores = 1)
The output of the function is a list containing the identified similarities, the result of the clustering and some intermediate results of the algorithm. In the following, the complex output is explained in more detail.
In this list element the results of the repeated random sampling, if performed, are summarized as a data frame. The columns of this data frame represent the individual motifs named in the column name. The rows represent the individual simulations of the repeated random sampling with the reference repertoire, where the first row refers to the sample set. The cells of the data frame contain numerical values that represent the frequency of the motif (indicated by the column) in the respective simulation step or in the sample set (indicated by the row).
If the output is stored in a file via the input parameter result_folder, this part of the output can be found in the following file: kmer_resample_sim_depth_log.txt
The frequencies of six motifs in the sample set and in the first four simulations are given below as examples:
res_gliph_combined$sample_log[1:5, 1:6]
This list element contains the summary of the statistical evaluation of the identified motifs and is itself a list containing on the one hand the summary of all motifs (all_motifs) and on the other hand the summary of the significantly enriched motifs (selected_motifs).
If the output is stored in a file via the input parameter result_folder, this part of the output can be found in the following files:
File name of selected_motifs : local_similarities_minp_lcminpminovelcminovekmer_mindepthkmer_mindepth.txt
File name of all_motifs : all_motifs.txt
In the following, the statistics of the first three significantly enriched motifs are presented as an example:
res_gliph_combined$motif_enrichment$selected_motifs[1:3,]
The columns contain the following information:
If Fisher's Exact Test was carried out for global simialrity search, this list element contains the summary of the statistical evaluation of the identified global structures and is itself a list containing on the one hand the summary of all global structures (all_structs) and on the other hand the summary of the significantly enriched global structures (selected_structs). If the output is saved to a file via the input parameter result_folder, this part of the output can be found in the following files:
File name of selected_structs : global_similarities_minp_gcminpminovegcminovekmer_mindepthgckmer_mindepth.txt
File name of all_structs : all_global_similarities.txt
The following is an example of the statistics of the first three structures:
res_gliph_combined_2$global_enrichment$selected_structs[1:3,]
The columns contain the following information:
In this data frame, all connections of the sequences are clustered by local and global similarities. The first two columns contain the connected sequences, the third column the type of connection (local, global or singleton for sequences without connections) and the fourth column the tag of the cluster.
If the output is stored in a file via the input parameter result_folder, this part of the output can be found in the following file: clone_network.txt
The following is an example of the first three connections:
res_gliph_combined$connections[1:3,]
This data frame summarizes all identified clusters, their members and the individual scores of the clusters. If the output is saved to a file via the input parameter result_folder, this part of the output can be found in the following file: convergence_groups.txt
The following is an example of the information provided by the first three clusters:
res_gliph_combined$cluster_properties[1:3,]
As shown, several pieces of information are provided for each cluster:
The elements of this list, named with the corresponding cluster tags, contain data frames with the members of the respective cluster and all additional information provided by the user at the beginning of the algorithm in the parameter cdr3_sequences. If the output is stored in a file via the input parameter result_folder, this part of the output can be found as data frame in the following file: cluster_member_details.txt
The following is an example of the information of the first two members of the first cluster:
res_gliph_combined$cluster_list[[1]][1:2,]
This list contains all input parameters with which the run of the turbo_Gliph function was called to get the settings for other methods. Also, the user can check again at a later time which settings he had chosen for the run.
If the output is stored in a file via the input parameter result_folder, this part of the output can be found as a data frame in the following file: parameter.txt
In this package we provide a method to visualize the results of turboGliph, gliph2 and gliph_combined. The default configuration provides the plot shown below.
plot_network(clustering_output = res_turbogliph, n_cores = 1)
This plot is interactive. Scrolling zooms, hovering over a node displays additional information about that node, and clicking on a node highlights all direct neighbours. Individual clusters can be selected by the cluster tag using the selection field in the upper left corner.
With the default configuration all nodes have the same size. The nodes are colored according to the log value of their cluster score using the viridis palette. Low scores (and thus more significant clusters) are represented by a high proportion of purple. High scores (and thus less significant clusters) are represented by a high proportion of yellow.
The provided method offers additional options to individualize the graph. For example, the nodes can be automatically colored according to the values of a particular column in the input data frame. For each sequence, individually selected colors can also be specified in a column named color in the input data frame. Also the size of the nodes can be changed automatically by values from a column in the input data frame. For more details, please look in the documentation of plot_network. Additionally the used color palette for nodes, the colors for edges and some more is customizable (read the documentation for more details).
In the following, an example plot is generated in which the nodes are colored according to the pathogen against which these TCRs are directed. Additionally, the color palette is changed and two additional properties of the sequences are shown in the info box. It can be clearly seen that turboGliph accumulates TCRs with identical pathogen specificity, characterized by identical node color, in the convergence groups with low scores.
plot_network(clustering_output = res_turbogliph, color_info = "antigen.species", color_palette = grDevices::rainbow, local_edge_color = "darkgrey", global_edge_color = "orange", show_additional_columns = c("antigen.species", "antigen"), cluster_min_size = 6, n_cores = 1)
The parameter show_additional_columns is a powerful tool to assemble the info box displayed by hovering over the nodes with all information you need to see. By default, the cluster tag and total cluster score are shown. For the function gliph2, additionally the fisher score is specified. In the following list, all further package intern values for this parameter are listed. Additional values (as in the example above) refer to columns in the input data frame cdr3_sequences.
lowest.hlas : listing of enriched HLA alleles (optional)
GLIPH2 :
This function is an automated implementation of the method used in the GLIPH paper for in silico prediction of CDR3b sequences with similar specificity based on a specificity cluster identified by GLIPH. The underlying algorithm is based on the method used in the original publication.(^1)
Briefly, depending on the CDR3b sequence length, a global positional weight matrix (PWM) is formed with the positional abundances of amino acids in the cluster. Based on the probabilities of the PWM, a huge number of random sequences are generated. To rank the generated sequences, a score is assigned to each sequence. This score is calculated from the product of the frequencies of the first ten N-terminal amino acids of a sequence in the PWM. In addition, this score can be normalized to the probability of obtaining at most such a low score in a naive reference repertoire.
This function is used by passing the results of the turbo_gliph, gliph2 or gliph_combined function and the tag of the desired specificity cluster to the function. In this case, the first cluster is used as an example. The parameter make_figure is used to specify whether a figure similar to the original publication should be automatically displayed for convergence of the scores of the generated sequences.(^1)
new_seqs <- turboGliph::de_novo_TCRs(convergence_group_tag = res_turbogliph$cluster_properties$tag[1], clustering_output = res_turbogliph, sims = 10000, make_figure = TRUE, n_cores = 1)
The generated sequences and their scores can be found as a data frame in the first element of the output list:
new_seqs$de_novo_sequences[1:3,]
1: Glanville, Jacob, et al. "Identifying specificity groups in the T cell receptor repertoire." Nature 547.7661 (2017): 94.
2: https://github.com/immunoengineer/gliph
3: Huang, Huang, et al. "Analyzing the Mycobacterium tuberculosis immune response by T-cell receptor clustering with GLIPH2 and genome-wide antigen screening." Nature Biotechnology 38.10 (2020): 1194-1202.
4: http://50.255.35.37:8080/tools
packageVersion("turboGliph") sessionInfo()
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.