anchor_pair_enrich: Determine enriched motifs in anchors
In jhammelman/spatzie: Identification of enriched motif pairs from chromatin interaction data

anchor_pair_enrich

R Documentation

Determine enriched motifs in anchors

Description

Determine whether motifs between paired bed regions have a statistically significant relationship. Options for significance are motif score correlation, motif count correlation, or hypergeometric motif co-occurrence.

Usage

anchor_pair_enrich(interaction_data, method = c("count", "score", "match"))

Arguments

interaction_data

an interactionData object of paired genomic regions

method

method for co-occurrence, valid options include:

`count`:	correlation between counts (for each anchor, tally positions where motif score > `5 * 10^{-5}`)
`score`:	correlation between motif scores (for each anchor, use the maximum score over all positions)
`match`:	association between motif matches (for each anchor, a match is defined if the is at least one position with a motif score > `5 * 10^{-5}`)

Value

an interactionData object where obj$pair_motif_enrich contains the p-values for significance of seeing a higher co-occurrence than what we get by chance.

Score-based correlation

We assume motif scores follow a normal distribution and are independent between enhancers and promoters. We can therefore compute how correlated scores of any two transcription factor motifs are between enhancer and promoter regions using Pearson's product-moment correlation coefficient:

r = \frac{\sum (x^{\prime}_i - \bar{x}^{\prime})(y^{\prime}_i - \bar{y}^{\prime})}{\sqrt{\sum(x^{\prime}_i - \bar{x}^{\prime})^2\sum(y^{\prime}_i - \bar{y}^{\prime})^2}}

, where the input vectors \boldsymbol{x} and \boldsymbol{y} from above are transformed to vectors \boldsymbol{x^{\prime}} and \boldsymbol{y^{\prime}} by replacing the set of scores with the maximum score for each region:

x^{\prime}_i = \max x_i

x^{\prime}_i is then the maximum motif score of motif a in the promoter region of interaction i, y^{\prime}_i is the maximum motif score of motif b in the enhancer region of interaction i, and \bar{x}^{\prime} and \bar{y}^{\prime} are the sample means.

Significance is then computed by transforming the correlation coefficient r to test statistic t, which is Student t-distributed with n - 2 degrees of freedom.

t = \frac{r\sqrt{n-2}}{\sqrt{1-r^2}}

All p-values are calculated as one-tailed p-values of the probability that scores are greater than or equal to r.

Count-based correlation

Instead of calculating the correlation of motif scores directly, the count-based correlation metric first tallies the number of instances of a given motif within an enhancer or a promoter region, which are defined as all positions in those regions with motif score p-values of less than 5 * 10^{-5}. Formally, the input vectors \boldsymbol{x} and \boldsymbol{y} are transformed to vectors \boldsymbol{x^{\prime\prime}} and \boldsymbol{y^{\prime\prime}} by replacing the set of scores with the cardinality of the set:

x^{\prime\prime}_i = |x_i|

And analogous for y^{\prime\prime}_i. Finally, the correlation coefficient r between \boldsymbol{x^{\prime\prime}} and \boldsymbol{y^{\prime\prime}} and its associated significance are calculated as described above.

Match-based association

Instance co-occurrence uses the presence or absence of a motif within an enhancer or promoter to determine a statistically significant association, thus \boldsymbol{x^{\prime\prime\prime}} and \boldsymbol{y^{\prime\prime\prime}} are defined by:

x^{\prime\prime\prime}_i = \boldsymbol{1}_{x^{\prime\prime}_i > 0}

Instance co-occurrence is computed using the hypergeometric test:

p = \sum_{k=I_{ab}}^{P_a} \frac{binom(P_a, k) binom(n - P_a, E_b - k)}{binom(n, E_b)},

where I_{ab} is the number of interactions that contain a match for motif a in the promoter and motif b in the enhancer, P_a is the number of promoters that contain motif a (P_a = \sum^n_i x^{\prime\prime\prime}_i), E_b is the number of enhancers that contain motif b (E_b = \sum^n_i y^{\prime\prime\prime}_i), and n is the total number of interactions, which is equal to the number of promoters and to the number of enhancers.

Author(s)

Jennifer Hammelman

Konstantin Krismer

Examples

## Not run: 
genome_id <- "BSgenome.Mmusculus.UCSC.mm9"
if (!(genome_id %in% rownames(utils::installed.packages()))) {
  BiocManager::install(genome_id, update = FALSE, ask = FALSE)
}
genome <- BSgenome::getBSgenome(genome_id)

motifs_file <- system.file("extdata/motifs_subset.txt.gz",
                           package = "spatzie")
motifs <- TFBSTools::readJASPARMatrix(motifs_file, matrixClass = "PFM")

yy1_pd_interaction <- scan_motifs(spatzie::interactions_yy1, motifs, genome)
yy1_pd_interaction <- filter_motifs(yy1_pd_interaction, 0.4)
yy1_pd_count_corr <- anchor_pair_enrich(yy1_pd_interaction, method = "count")

## End(Not run)

res <- anchor_pair_enrich(spatzie::scan_interactions_example_filtered,
                          method = "score")

jhammelman/spatzie documentation built on Feb. 8, 2024, 8:50 a.m.