run_matrix_spma: Matrix-based Spectrum Motif Analysis

View source: R/main.R

run_matrix_spmaR Documentation

Matrix-based Spectrum Motif Analysis

Description

SPMA helps to illuminate the relationship between RBP binding evidence and the transcript sorting criterion, e.g., fold change between treatment and control samples.

Usage

run_matrix_spma(
  sorted_transcript_sequences,
  sorted_transcript_values = NULL,
  transcript_values_label = "transcript value",
  motifs = NULL,
  n_bins = 40,
  midpoint = 0,
  x_value_limits = NULL,
  max_model_degree = 1,
  max_cs_permutations = 1e+07,
  min_cs_permutations = 5000,
  max_hits = 5,
  threshold_method = "p_value",
  threshold_value = 0.25^6,
  max_fg_permutations = 1e+06,
  min_fg_permutations = 1000,
  e = 5,
  p_adjust_method = "BH",
  n_cores = 1,
  cache = paste0(tempdir(), "/sc/")
)

Arguments

sorted_transcript_sequences

named character vector of ranked sequences (only containing upper case characters A, C, G, T), where the names are RefSeq identifiers and sequence type qualifiers ("3UTR", "5UTR" or "mRNA"), separated by "|", e.g. "NM_010356|3UTR". Names are only used to cache results. The sequences in sorted_transcript_sequences must be ranked (i.e., sorted). Commonly used sorting criteria are measures of differential expression, such as fold change or signal-to-noise ratio (e.g., between treatment and control samples in gene expression profiling experiments).

sorted_transcript_values

vector of sorted transcript values, i.e., the fold change or signal-to-noise ratio or any other quantity that was used to sort the transcripts that were passed to run_matrix_spma or run_kmer_spma (default value is NULL). These values are displayed as a semi-transparent area over the enrichment value heatmaps of spectrum plots.

transcript_values_label

label of transcript sorting criterion (e.g., "log fold change", default value is "transcript value"), only shown if !is.null(sorted_transcript_values)

motifs

a list of motifs that is used to score the specified sequences. If is.null(motifs) then all Transite motifs are used.

n_bins

specifies the number of bins in which the sequences will be divided, valid values are between 7 and 100

midpoint

for enrichment values the midpoint should be 1, for log enrichment values 0 (defaults to 0)

x_value_limits

sets limits of the x-value color scale (used to harmonize color scales of different spectrum plots), see limits argument of continuous_scale (defaults to NULL, i.e., the data-dependent default scale range)

max_model_degree

maximum degree of polynomial

max_cs_permutations

maximum number of permutations performed in Monte Carlo test for consistency score

min_cs_permutations

minimum number of permutations performed in Monte Carlo test for consistency score

max_hits

maximum number of putative binding sites per mRNA that are counted

threshold_method

either "p_value" (default) or "relative". If threshold_method equals "p_value", the default threshold_value is 0.25^6, which is lowest p-value that can be achieved by hexamer motifs, the shortest supported motifs. If threshold_method equals "relative", the default threshold_value is 0.9, which is 90% of the maximum PWM score.

threshold_value

semantics of the threshold_value depend on threshold_method (default is 0.25^6)

max_fg_permutations

maximum number of foreground permutations performed in Monte Carlo test for enrichment score

min_fg_permutations

minimum number of foreground permutations performed in Monte Carlo test for enrichment score

e

integer-valued stop criterion for enrichment score Monte Carlo test: aborting permutation process after observing e random enrichment values with more extreme values than the actual enrichment value

p_adjust_method

adjustment of p-values from Monte Carlo tests to avoid alpha error accumulation, see p.adjust

n_cores

the number of cores that are used

cache

either logical or path to a directory where scores are cached. The scores of each motif are stored in a separate file that contains a hash table with RefSeq identifiers and sequence type qualifiers as keys and the number of putative binding sites as values. If cache is FALSE, scores will not be cached.

Details

In order to investigate how motif targets are distributed across a spectrum of transcripts (e.g., all transcripts of a platform, ordered by fold change), Spectrum Motif Analysis visualizes the gradient of RBP binding evidence across all transcripts.

The matrix-based approach skips the k-merization step of the k-mer-based approach and instead scores the transcript sequence as a whole with a position specific scoring matrix.

For each sequence in foreground and background sets and each sequence motif, the scoring algorithm evaluates the score for each sequence position. Positions with a relative score greater than a certain threshold are considered hits, i.e., putative binding sites.

By scoring all sequences in foreground and background sets, a hit count for each motif and each set is obtained, which is used to calculate enrichment values and associated p-values in the same way in which motif-compatible hexamer enrichment values are calculated in the k-mer-based approach. P-values are adjusted with one of the available adjustment methods.

An advantage of the matrix-based approach is the possibility of detecting clusters of binding sites. This can be done by counting regions with many hits using positional hit information or by simply applying a hit count threshold per sequence, e.g., only sequences with more than some number of hits are considered. Homotypic clusters of RBP binding sites may play a similar role as clusters of transcription factors.

Value

A list with the following components:

foreground_scores the result of score_transcripts for the foreground sets (the bins)
background_scores the result of score_transcripts for the background set
enrichment_dfs a list of data frames, returned by calculate_motif_enrichment
spectrum_info_df a data frame with the SPMA results
spectrum_plots a list of spectrum plots, as generated by score_spectrum
classifier_scores a list of classifier scores, as returned by classify_spectrum

See Also

Other SPMA functions: classify_spectrum(), run_kmer_spma(), score_spectrum(), subdivide_data()

Other matrix functions: calculate_motif_enrichment(), run_matrix_tsma(), score_transcripts(), score_transcripts_single_motif()

Examples

# example data set
background_df <- transite:::ge$background_df
# sort sequences by signal-to-noise ratio
background_df <- dplyr::arrange(background_df, value)
# character vector of named and ranked (by signal-to-noise ratio) sequences
background_seqs <- gsub("T", "U", background_df$seq)
names(background_seqs) <- paste0(background_df$refseq, "|",
  background_df$seq_type)

results <- run_matrix_spma(background_seqs,
                           sorted_transcript_values = background_df$value,
                           transcript_values_label = "signal-to-noise ratio",
                           motifs = get_motif_by_id("M178_0.6"),
                           n_bins = 20,
                           max_fg_permutations = 10000)

## Not run: 
results <- run_matrix_spma(background_seqs,
                           sorted_transcript_values = background_df$value,
                           transcript_values_label = "SNR") 
## End(Not run)


kkrismer/transite documentation built on July 13, 2024, 8:01 a.m.