knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)

Overview

AbNames performs three tasks:

Antibody names are not reported in a consistent format in published data. Antibodies are often named according to the antigen they target, which may not be the same as the name of the protein (complex) the antigen is part of. Antibodies may target multi-subunit protein complexes, and this can be reflected in the name, e.g. an antibody against the T-cell receptor alpha and beta subunits might be named TCRab. Reported antibody names may also include the name of the clone the antibody is derived from or the names of fluorophores or DNA-oligos the antibody is conjugated with. For these reasons, it can be difficult to exactly match antibody names with gene or protein names. As cell surface antigens often have very similar names, searching for partial matches to names in free-text gene descriptions is challenging and error-prone.

Data

AbNames contains several curated gene name data sets for matching to antibody names.

Gene aliases

The gene_aliases data set is primarily based on the protein-coding and gene groups tables from the Human Gene Names Consortium (HGNC). These include previous (obsolete) gene names and aliases, which in our experience have been useful for matching to antibody names. Non-ambiguous gene aliases from Ensembl (fetched via Bioconductor package biomaRt) and NCBI (fetched via the NCBI ftp site and the Bioconductor package org.Hs.eg.db), and non-ambiguous protein names from the Cell-Surface Protein Atlas (CSPA, Cell-Surface Protein Atlas) and Cellmarker protein database (http://xteam.xbio.top/CellMarker/) have been added. Mappings between HGNC, Ensembl and NCBI (Entrez) IDs are mostly based on HGNC, with some corrections of obsolete Ensembl IDs using the Ensembl data.

The gene_aliases data set is in long format (one alias per row).

Load using:

library(AbNames)
library(dplyr)

data("gene_aliases", package = "AbNames") 

# Show the first entries of gene_aliases,
# where each row is the start of one column
dplyr::glimpse(gene_aliases) 

# (Note: it isn't necessary to use dplyr:: to call "glimpse" as dplyr is loaded
# with the library call above. This syntax is used to make it clear which
# packages functions belong to)

BioLegend antibodies

BioLegend is a major supplier of antibodies, and provides several antibody panels for CITE-seq analyses. The data set "totalseq" is a re-formatted version of the TotalSeq barcodes data sheets available from the BioLegend website, including BioLegend antibody names and Ensembl gene IDs. The isotypes of the antibodies are not included in the TotalSeq data sheets, and only human antibodies and isotype controls are included. Missing and incomplete Ensembl IDs have been manually corrected.

Load using:

data("totalseq", package = "AbNames")
dplyr::glimpse(totalseq)

Note that the "Antigen" column here refers to antibody names with prefixes such as "anti-human" removed.

CITE-seq antibodies

This is a table matching antibody names to gene and protein IDs from >40 data sets with publicly available CITE-seq data. The data sets are data that we have worked with and we would be happy to add other data sets if provided. The AbNames package collects the functions that were used to create this table. This table includes manually curated matches between antibody names and gene IDs for cases where we were unable to find an exact match to the antibody name provided. Isotype controls were manually identified based on either the name, e.g. "IgG1 Isotype Ctrl" or the reactivity, e.g. "Mouse IgG2b".

Load using

# As the citeseq data set contains raw data, it is loaded differently
# than the other data sets 

citeseq_fname <- system.file("extdata", "citeseq.csv", package = "AbNames")
citeseq <- read.csv(citeseq_fname) %>% unique()
dplyr::glimpse(citeseq)

Here the column "Antigen" refers to the name given by the authors of the study in the table of resources used. Some minimal formatting has been done, for example fixing obvious data import errors and replacing Greek characters.

Creating a query table

To illustrate the problem with matching antibody names to gene names, let's look at an example from the citeseq data set.

# The regular expression inside "grepl" searches for PD.L1, PD-L1, PDL1 or CD274

cd274 <- citeseq %>%
    dplyr::filter(grepl("PD[\\.-]?L1|CD274", Antigen)) %>%
    dplyr::pull(Antigen) %>%
    unique()

cd274

All of the antigen names above refer to the same cell surface protein, as reported in different studies.

Default query table

We will demonstrate how to create a default query table using the raw CITE-seq data set. This collects information from the reagents tables provided as supplementary material in the studies used.

# Add an ID column to the citeseq data set to allow the results table to be merged
citeseq <- AbNames::addID(citeseq)

# Remove control columns, as we are only searching human proteins
controls <- dplyr::filter(citeseq, Isotype_Control)
citeseq <- dplyr::filter(citeseq, ! Isotype_Control)

# TO DO: THE CSV FILE SHOULDN'T HAVE THE SYMBOLS
#id_cols <- c("HGNC_ID", "HGNC_SYMBOL", "ENSEMBL_ID", "UNIPROT_ID")


# Select just the columns that are needed for querying - 
# An ID column, which here is the antigen and study as there are multiple
# studies:
citeseq_q <- citeseq %>% dplyr::select(ID, Antigen)

# Apply the default transformation sequence and make a query table in long format
query_df <- AbNames::makeQueryTable(citeseq_q, ab = "Antigen")

# Print one example antigen from the query table:
query_df %>%
    dplyr::filter(ID == "TCR alpha/beta__Hao_2021")

The example above shows that each antibody name has been reformatted in several ways. When querying a data set, we search for an exact match to any of these strings.

Querying a dataset

Our pipeline is to query the HGNC and then use other datasets and the antibody vendor information to find matches for the unmatched antibodies.

By default, we require that matches must be found for all subunits of a multi-subunit protein to avoid incorrect matches.

Here we will query the HGNC data set for the CITE-seq antibodies, using the query table created above.

set.seed(549867)
alias_results <- searchAliases(query_df)

# Print 10 random results:
alias_results %>%
    dplyr::select(ID, name, value, symbol_type) %>%
    dplyr::ungroup() %>%
    dplyr::sample_n(10)

The results table above contains (just) the matches between the query table and the gene aliases table. In the example above, we see in the "name" column the formatting function that generated the string(s) in the "value" column that were matched, and in the "symbol_type" column which column of the HGNC data was matched. If there are multiple matches for a given ID, the official symbol ("HGNC_SYMBOL") is preferred over aliases or previous symbols.

We may wish to review the matches before merging the results into the original table. For example, we can check for matches to different genes where the name was not guessed to be a multi-subunit protein.

alias_results <- alias_results %>%
    dplyr::group_by(ID) %>%
    dplyr::mutate(n_ids = dplyr::n_distinct(HGNC_ID)) # Count distinct IDs

# Select antigens where there are matches to multiple genes but not
# because the antibody is against a multi-gene protein 
multi_gene <- alias_results %>%
    dplyr::filter(n_ids > 1, ! all(name %in% c("TCR_long", "subunit"))) %>%
    dplyr::select(ID, name, value, HGNC_ID, ALT_ID)

# Look at the first group in multi_gene.
# Set interactive = TRUE (the default) for interactive exploration
showGroups(multi_gene, interactive = FALSE)

# This is an example where the antibody is against a heterodimeric protein.
# We can confirm this by looking up the vendor catalogue number:
citeseq %>%
    dplyr::filter(ID == "CD11a.CD18__Wu_2021") %>%
    dplyr::select(Antigen, Cat_Number, Vendor)

For now I will remove all genes with multiple matches before merging the results into the citeseq data.

nrow(alias_results)

id_cols <- c("HGNC_ID", "HGNC_SYMBOL", "ENSEMBL_ID", "UNIPROT_ID", "ALT_ID")

# Remove matches to several genes, select just columns of interest

alias_results <- alias_results %>%
    dplyr::group_by(ID) %>%
    dplyr::select(matches("ID|HGNC"), name) %>% # Select ID and HGNC columns
    unique() %>% # Collapse results with same ID from different queries

    # Collapse multi-subunit entries, convert "NA" to NA
    dplyr::summarise(dplyr::across(all_of(id_cols), ~toString(unique(.x)))) %>%
    dplyr::mutate(dplyr::across(all_of(id_cols), ~na_if(.x, "NA")))   

nrow(alias_results)

# We will remove the existing ID columns from the citeseq data set
# before joining in the query results

citeseq <- citeseq %>%
    dplyr::select(-any_of(id_cols)) %>% # Saved citeseq data already includes IDs
    dplyr::left_join(alias_results, by = "ID") %>%
    dplyr::relocate(ID, Antigen, Cat_Number, HGNC_ID) %>%
    unique()

head(citeseq)

Filling missing information

Filling using the TotalSeq dataset

Another approach to annotating antibodies is to use the totalseq data. We are not confident about the assignment of antibodies to Ensembl identifiers. However, this may not be a problem if the aim is simply to standardise names between datasets or to a common reference.

The function searchTotalseq by default matches sequentially to columns catalogue number (Cat_Number), Antigen, then Clone. Matching columns may be configured, but column names in the query data set should match those in the totalseq data set.

table(is.na(citeseq$ALT_ID))
cs <- citeseq %>%
    searchTotalseq()
table(is.na(cs$ALT_ID))

We can also try to annotate antibodies that could not be annotated using the HGNC by using the TotalSeq table. This could also be done as a first matching step.

count_missing <- function(df){
    dplyr::filter(df, is.na(ALT_ID)) %>% nrow()
}

data(totalseq)
ts <- totalseq %>%
    dplyr::select(any_of(colnames(citeseq)))

missing <- count_missing(citeseq)

# Fill in IDs where Antigen, Oligo, Clone and TotalSeq category match
citeseq <- citeseq %>%
    dplyr::rows_patch(ts,
                      by = c("Antigen", "Oligo_ID", "Clone", "TotalSeq_Cat"),
                      unmatched = "ignore")

missing_after_ts <- count_missing(citeseq)

# Missing before filling:
missing

# Still missing after filling with TotalSeq:
missing_after_ts

Inspecting groups before filling

When filling in information from a reference data set, it can be useful to look at entries where for example annotations are inconsistent. showGroups is a function that allows users to interactively print group(s) from a grouped data.frame. We give a (non-interactive) example below.

Filling NAs using a reference data set

The function fillByGroup is used to fill in NAs in a grouped data.frame. It differs from tidyr::fill in its treatment of inconsistent values. Whereas tidyr::fill will fill using the first value, fillByGroup offers the option to fill in the most frequent value. This can be useful when filling antibody IDs given the antibody name and clone or catalogue number.

To fill using a reference data set, the strategy used is either to add a temporary ID column, join the two data.frames, fill and separate again using the ID column, or use dplyr::rows_patch. We will demonstrate the latter approach below.

# Before filling, check how many antibodies were matched.
original_nmatched <- count_missing(citeseq)

# Select some data to demonstrate filling:
# Get entries sharing the same catalogue number,
# where not every entry has a match
fill_demo <- citeseq %>%
    dplyr::group_by(Cat_Number) %>%
    dplyr::arrange(Cat_Number) %>%
    dplyr::filter(!is.na(Cat_Number), 
                  any(is.na(HGNC_ID)),
                  ! all(is.na(HGNC_ID)))

AbNames::showGroups(fill_demo, interactive = FALSE)

In the above example, we can see that for antibodies with catalogue number "300475", a match was found if the antibody was named "CD3E" but not if it was named "CD3". We can fill in the missing information using fillByGroup:

fill_demo <- AbNames::fillByGroup(fill_demo, "Cat_Number",
                                  fill = c("HGNC_ID", "TotalSeq_Cat", "Vendor",
                                           "ENSEMBL_ID", "UNIPROT_ID"),
                                  multiple = "mode") %>%
    dplyr::group_by(Cat_Number) # Re-group as fillByGroup ungroups

# Print out the first group again
AbNames::showGroups(fill_demo, interactive = FALSE)

*In the above call to fillByGroup, we set multiple = "mode" *

Now we will fill the gene information in the citeseq data similarly, and check how many antibodies have not been matched.

# We fill by grouping the Catalogue number:
citeseq <- citeseq %>%
     AbNames::fillByGroup("Cat_Number", multiple = "mode",
                          fill = c("HGNC_ID", "TotalSeq_Cat", "Vendor",
                                   "ENSEMBL_ID", "UNIPROT_ID"))

nmatched_after_fill <- count_missing(citeseq)

print("Before filling:")
original_nmatched
print("After filling:")
nmatched_after_fill

Custom query table

AbNames provides a few template functions that can be used to construct a pipeline for creating a query table. The default query table uses the function defaultQuery to create a list of partial functions which are then applied recursively to the data.frame using magrittr::freduce. To use this strategy, all functions must accept the data.frame as the only required argument. Other arguments should be filled in when creating the partial functions.

The default function list can used as a starting point to add extra formatting functions, remove or modify certain steps. The individual formatting functions are introduced below (SO FAR ONLY THE T-CELL RECEPTORS)

# Get the default sequence of formatting functions 
default_funs <- AbNames::defaultQuery()

# Print the first two formatting functions as an example
default_funs[1:2]

T-cell receptors

We found that querying the HGNC gene description was the easiest way to match the names of antibodies against the T-cell receptor to their gene IDs. (An alternative method is to search for the gene symbol.)

We will start with a data.frame of antibodies against the T-cell receptor from the CITE-seq data set. We have already done some formatting of these names, e.g. for "TCR Va24-Ja18 (iNKT cell)" we have removed the section in brackets. We wish to create a query data.frame where each subunit of the TCR complex appears on a separate line.

Note: this function may cause false matches if the numbering of the gene name does not match that of the antibody.

tcr <- data.frame(Antigen =
                      c("TCR alpha/beta", "TCRab", "TCR gamma/delta", "TCRgd",
                        "TCR g/d", "TCR Vgamma9", "TCR Vg9", "TCR Vd2",
                        "TCR Vdelta2", "TCR Vα24-Jα18", "TCRVa24.Ja18", 
                        "TCR Valpha24-Jalpha18", "TCR Vα7.2", "TCR Va7.2",
                        "TCRa7.2", "TCRVa7.2", "TCR Vbeta13.1", "TCR γ/δ", 
                        "TCR Vβ13.1", "TCR Vγ9", "TCR Vδ2", "TCR α/β",
                        "TCRb", "TCRg"))


# First, we convert the Greek symbols to letters.  
# Note that as we are using replaceGreekSyms in a dplyr pipeline, we don't 
# put quotes around the column name, i.e. Antigen not "Antigen".

tcr <- tcr %>%
    dplyr::mutate(query = 
                    AbNames::replaceGreekSyms(Antigen, replace = "sym2letter"))

# Print out a few rows to see the result
tcr %>% 
    dplyr::filter(Antigen %in% c("TCR Vβ13.1", "TCR Vδ2", "TCR α/β"))

Now we use the function formatTCR to format the antibody names for querying the gene description field of the HGNC data set.

tcr_f <- AbNames::formatTCR(tcr, tcr = "query")

# Print out the first few rows
tcr_f %>%
    head() 

Notes about the gene aliases data set

The gene aliases data set is based on the annotation from the Human Genome Naming Consortium (HGNC). Annotation databases such as Ensembl and Entrez perform mapping between gene models independently, using different criteria. This means that they do not always agree on mappings between identifiers. In the gene aliases data set the HGNC mappings are given, and were used when joining data from other sources. AbNames is focused on matching names more than identifiers, so we considered that the official naming organisation should be our reference. This also means that using an Ensembl, Entrez or Uniprot ID to search for gene aliases may not work.

We found that the HGNC mappings between HGNC and Ensembl gene IDs almost always agreed. The HGNC uses SwissProt Uniprot IDs. When joining data from other sources, we tried to use two points of agreement, e.g. agreement between one type ID and the official symbol, or agreement between two types of ID but disagreement about the official symbol.

The code used for creating all of the data sets is available in data-raw. At the moment, we do not have a main file for regenerating the data sets - this is on our to do list!

TO DO

An example of (interactively) deciding what to do if groups are inconsistent. Function to check new annotations against previous Isotype controls CD25 (good example, multiple genes), CD270, CD279 matched in HGNC? CD279 PD-1 - example where only some are matched Fix Hao Clone CD25 - Is it custom? No Oligo or Cat_Number for this one Unintuitive to use quoted for splitUnnest? Redocument data sets HAO CD45 should have catalogue numbers! union join back gene ID / get info by antigen Put RRID into CITEseq Export and demonstrate group_by_any? e.g. citeseq <- citeseq %>% group_by_any(c("Antigen", "Cat_Number") showGroups(citeseq, 2, interactive = FALSE) ungroup(citeseq) Add tag for non-human tagging antibodies Problems: + HGNC_SYMBOL WRONG FOR CD158b (KIR2DL2/L3, NKAT2)? (NKAT2 = only one gene) + Triana RRID same for CD235a and CD235a - match via combination of RRID and Antigen + Triana group 5... - matching because of NA? Update BIOMART entrez ids to match entrez / HGNC?? Check that UNIPROT IDs are valid (not all from CSPA are up to date) Change name of data gene_aliases to just aliases Rename "value" to "alias" in aliases table Final check for ambiguous aliases after adding proteins HGNC: TRAV10 = TCRAV24S1, NCBI:TRAV24 = TCRAV24S1 Be careful of case when removing ambiguous aliases - Met SLTM / MET Hao CD25 - same HGNC ID reported twice HGNC:6008|HGNC:6008 and CD45RA KLG / MAFA Keep source in ID result table? Is TCRg clone B1 = TCRg/d? Add to vignette, how to check results in matchToCITEseq translit characters in query table find replacement dataset for diamonds in test make a pipeline for regenerating gene_aliases From gene_aliases, remove entries where everything is identical except symbol_type - prefer alias to alias name, previous_symbol to previous_name etc - cellmarker %>% unique() cellmarker Alkaline phosphatase maps to 2 IDs symbol_type is NA biomart gives different entrez ids? CD1b has two different HGNC_IDs SIGLEC5 To do: what is the point of "name" in query table? why two rows for CD124|IL4RA in alias_results - aggregate source export left_join_any make a wrapper for group_by_any which ignores clone duplicates when study includes same antigen multiple times, give an underscore # suffix? To do: indicate cocktail membership Should the ALT_ID be the symbol not the ID to allow easier matching? Add checks in makeQueryTable, e.g. for brackets? Fix errors, remove "ignore" for filling clone etc Decide whether antibodies should be listed separately for each subexperiment In SingleCellExperiment vignette, isotype control problem no longer happens, why? rationale for not joining totalseq if there is already an id? Check CD11a.CD18 Wu_2021, HGNC_SYMBOL not updated CD112 Nectin 2 is unannotated Is it actaully necessary to have an ID column? CD11a.CD18__Wu_2021 matches CD1111 and CD11D showGroups should print ALT_ID as well as HGNC_ID working with other species - we haven't done this, but the functions are included in the package CD158b should be removed from data??

NOTE: I haven't written a convenience function for merging back into the original table yet. TODO: QUERY FOR CD11/CD18 in protein ontology Remove redundant results because of Antigen/greek_letter e.g. "KLRG1 (MAFA)-Qian_2020"

Examples to discuss?

"KIR2DL5" matches two genes HGNC:16345 HGNC:16346 "DR3 (TRAMP)__Liu_2021" - tramp alias is ambiguous

Data for using in tests? Some non-standard names, including duplicates adt_nms <- c("KLRG1 (MAFA)", "KLRG1", "CD3 (CD3E)", "HLA.A.B.C", "HLA-A/B/C", "NKAT2", "CD66a.c.e", "CD66a_c_e", "CD11a/CD18 (LFA-1)") # NKAT2 = CD158b

Examples from sharedSubstr

Why no match for CD158f (KIR2DL5)__Su_2020 CD158f? because KIR2DL5 matches A and B CD16?? = FCGR3A (but also FCGR3B??), cat number 302061 correction comes from totalseq Check for PD-1 / HGNC = PDCD1 - some filled by symbol? CHECK CD3, some are CD3D, some CD3E - 300475 filled to CD3D?? - not filled for Stuart, no TotalSeq_Cat or vendor? Careful with CD3.2 - Mimitou, not equal to CD32! HLA-DR / HLA-DRA HGNC:4947 good example!

HLA_ABC__PomboAntunes_2021 maps to wrong symbol

Session Info

sessionInfo()


HelenLindsay/AbNames documentation built on June 6, 2023, 1:18 p.m.