separateSubunits: Separate multi-subunit protein names
In HelenLindsay/AbNames: Standardize Antibody Names

separateSubunits

R Documentation

Separate multi-subunit protein names

Description

Separate names of antibodies against multi-subunit proteins e.g. CD235ab, CD66ace into one subunit per row.

Two subunit patterns are considered. For the first, subunits are lower case letters and the gene name has no separator, e.g. CD66ace is composed of subunits CD66a, CD66b and CD66c. For the second pattern, subunits are written with uppercase letters and are separated with a "-", e.g. HLA-A/C/E is composed of subunits HLA-A, HLA-C and HLA-E. Both patterns require at least at least 2 capital letters or numbers followed by at least 2 possible subunits. There may be a separator between the groups and/or between the lower case letters. At present, the between group separators are -, . and space, and the between subunit separators are / and .

Subunits should be converted from Greek symbols before applying this function.

At present user-supplied regex patterns are not supported

Usage

separateSubunits(df, ab = "Antigen", new_col = "subunit")

Arguments

`df`	A data.frame or tibble
`ab`	(character(1), default "Antigen) Name of the column containing antibody names
`new_col`	(default: subunit) Name of new column containing guesses for single subunit names

Value

df, with a new column "subunit" containing potential individual subunits. Original rows of df are replicated for each subunit, i.e. the returned data.frame is in long format.

Author(s)

Helen Lindsay

Examples

df <- data.frame(ID = LETTERS[1:5],
                Antigen = c("CD235a/b", "CD235ab",
                            "HLA-ABC", "HLA-DR", "TCR alpha/beta"))

#Note that in this example, the TCR is not split as "alpha/beta" is too long
#to match the splitting pattern.  Also note that HLA-DR is split - this
#function doesn't check whether the results are real protein subunits.
separateSubunits(df)

HelenLindsay/AbNames documentation built on June 6, 2023, 1:18 p.m.