score_spectrum | R Documentation |
Spectrum scores are a means to evaluate if a spectrum has a meaningful (i.e., biologically relevant) or a random pattern.
score_spectrum(
x,
p_values = array(1, length(x)),
x_label = "log enrichment",
sorted_transcript_values = NULL,
transcript_values_label = "transcript value",
midpoint = 0,
x_value_limits = NULL,
max_model_degree = 3,
max_cs_permutations = 1e+07,
min_cs_permutations = 5000,
e = 5
)
x |
vector of values (e.g., enrichment values, normalized RBP scores) per bin |
p_values |
vector of p-values (e.g., significance of enrichment values) per bin |
x_label |
label of values (e.g., |
sorted_transcript_values |
vector of sorted transcript values, i.e.,
the fold change or signal-to-noise ratio or any other quantity that was used
to sort the transcripts that were passed to |
transcript_values_label |
label of transcript sorting criterion
(e.g., |
midpoint |
for enrichment values the midpoint should be |
x_value_limits |
sets limits of the x-value color scale (used to
harmonize color scales of different spectrum plots), see |
max_model_degree |
maximum degree of polynomial |
max_cs_permutations |
maximum number of permutations performed in Monte Carlo test for consistency score |
min_cs_permutations |
minimum number of permutations performed in Monte Carlo test for consistency score |
e |
integer-valued stop criterion for consistency score Monte
Carlo test: aborting permutation process after
observing |
One way to quantify the meaningfulness of a spectrum is to calculate the
deviance
between the linear interpolation of the scores of two adjoining bins and
the score
of the middle bin, for each position in the spectrum. The lower the score,
the more consistent
the trend in the spectrum plot. Formally, the local consistency score
x_c
is defined as
x_c = \frac{1}{n} \sum_{i = 1}^{n - 2}{|\frac{s_i + s_{i + 2}}{2} - s_{i + 1}|}.
In order to obtain an estimate of the significance of a particular
score x_c'
,
Monte Carlo sampling
is performed by randomly permuting the coordinates of the scores
vector s
and recomputing x_c
.
The probability estimate \hat{p}
is given by the lower tail
version of the cumulative
distribution function
\hat{Pr}(T(x)) = \frac{\sum_{i = 1}^n 1(T(y_i) \le T(x)) + 1}{n + 1},
where 1
is the indicator function, n
is the
sample size, i.e.,
the number of performed permutations, and T
equals x_c
in the above equation.
An alternative approach to assess the consistency of a spectrum plot is
via polynomial regression.
In a first step, polynomial regression models of various degrees are
fitted to the data, i.e.,
the dependent variable s
(vector of scores), and
orthogonal
polynomials of the independent variable b
(vector of
bin numbers).
Secondly, the model that reflects best the true nature of the data is
selected by
means of the F-test. And lastly, the adjusted R^2
and the sum of
squared residuals
are calculated to indicate how well the model fits the data. These
statistics are used as
scores to rank the spectrum plots.
In general, the polynomial regression equation is
y_i = \beta_0 + \beta_1 x_i + \beta_2 x_i^2 + \cdots + \beta_m x_i^m + \epsilon_i,
where m
is the degree of the polynomial (usually m \le 5
),
and \epsilon_i
is the error term.
The dependent variable y
is the vector of
scores s
and x
to x^m
are the orthogonal polynomials of
the vector of bin
numbers b
.
Orthogonal polynomials are used in order to reduce the correlation
between the different
powers of b
and therefore avoid multicollinearity in the model. This is important,
because correlated
predictors lead to unstable coefficients, i.e., the coefficients of a
polynomial
regression model of
degree m
can be greatly different from a model of degree m + 1
.
The orthogonal polynomials of vector b
are obtained by
centering (subtracting the mean),
QR decomposition, and subsequent normalization.
Given the dependent variable y
and the orthogonal
polynomials
of b
x
to
x^m
, the model coefficients \beta
are chosen
in a way to minimize
the deviance between the actual and the predicted values characterized by
M(x) = \beta_0 + \beta_1 x + \beta_2 x^2 + \cdots + \beta_m x^m
M = argmin_{M}{(\sum_{i = 1}^n{L(y_i, M(x_i))})},
where L(actual value, predicted value) denotes the loss function.
Ordinary least squares is used as estimation method for the model
coefficients \beta
. The loss
function of ordinary least squares is the sum of squared residuals
(SSR)
and is defined as follows
SSR(y,
\hat{y}) = \sum_{i = 1}^n{(y_i - \hat{y}_i)^2},
where y
are the observed data
and \hat{y}
the
model predictions.
Thus the ordinary least squares estimate of the
coefficients \hat{\beta}
(including the
intercept \hat{\beta}_0
) of the model M
is defined by
\hat{\beta} = argmin_{\beta}{(\sum_{i = 1}^n{(y_i - \beta_0 - \sum_{j = 1}^m{\beta_j x_i^j})^2})}.
After polynomial models of various degrees have been fitted to the data, the F-test is used to select the model that best fits the data. Since the SSR monotonically decreases with increasing model degree (model complexity), the relative decrease of the SSR between the simpler model and the more complex model must outweigh the increase in model complexity between the two models. The F-test gives the probability that a relative decrease of the SSR between the simpler and the more complex model given their respective degrees of freedom is due to chance. A low p-value indicates that the additional degrees of freedom of the more complex model lead to a better fit of the data than would be expected after a mere increase of degrees of freedom.
The F-statistic is calculated as follows
F = \frac{(SSR_1 - SSR_2) / (p_2 - p_1)}{SSR_2 / (n - p_2)},
where SSR_i
is the sum of squared residuals
and p_i
is the number
of parameters of
model i
. The number of data points, i.e., bins, is denoted as n
.
F
is distributed according to the F-distribution
with df_1 = p_2 - p_1
and df_2 = n - p_2
.
A list object of class SpectrumScore
with the following
components:
adj_r_squared | adjusted R^2 of polynomial model |
degree | maximum degree of polynomial |
residuals | residuals of polynomial model |
slope | coefficient of the linear term of the polynomial model (spectrum "direction") |
f_statistic | statistic of the F-test |
f_statistic_p_value | p-value of F-test |
consistency_score | normalized sum of deviance between the linear interpolation of the scores of two adjoining bins and the score of the middle bin, for each position in the spectrum |
consistency_score_p_value | obtained by Monte Carlo sampling (randomly permuting the coordinates of the scores vector) |
consistency_score_n | number of permutations |
plot |
Other SPMA functions:
classify_spectrum()
,
run_kmer_spma()
,
run_matrix_spma()
,
subdivide_data()
# random spectrum
score_spectrum(runif(n = 40, min = -1, max = 1), max_model_degree = 1)
# two random spectrums with harmonized color scales
plot(score_spectrum(runif(n = 40, min = -1, max = 1), max_model_degree = 1,
x_value_limits = c(-2.0, 2.0)))
plot(score_spectrum(runif(n = 40, min = -2, max = 2), max_model_degree = 1,
x_value_limits = c(-2.0, 2.0)))
# random spectrum with p-values
score_spectrum(runif(n = 40, min = -1, max = 1),
p_values = runif(n = 40, min = 0, max = 1),
max_model_degree = 1)
# random spectrum with sorted transcript values
log_fold_change <- log(runif(n = 1000, min = 0, max = 1) /
runif(n = 1000, min = 0, max = 1))
score_spectrum(runif(n = 40, min = -1, max = 1),
sorted_transcript_values = sort(log_fold_change),
max_model_degree = 1)
# non-random linear spectrum
signal <- seq(-1, 0.99, 2 / 40)
noise <- rnorm(n = 40, mean = 0, sd = 0.5)
score_spectrum(signal + noise, max_model_degree = 1,
max_cs_permutations = 100000)
# non-random quadratic spectrum
signal <- seq(-1, 0.99, 2 / 40)^2 - 0.5
noise <- rnorm(n = 40, mean = 0, sd = 0.2)
score_spectrum(signal + noise, max_model_degree = 2,
max_cs_permutations = 100000)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.