suppressPackageStartupMessages({ suppressMessages({ library(multtest) library(bayesm) library(forestplot) library(ggplot2) library(CSHstats) library(dplyr) library(knitr) }) })
Null hypothesis testing framework: we typically aim to assess whether an intervention affects a parameter value for a certain population
Type I error: reject the null when it is actually true
Type II error: fail to reject null when it is actually false
The p-value for a test is the probability of observing the statistic seen in the experiment or any more extreme value of the statistic given that the null hypothesis is true.
binom.test(27, 100, 1/4)
This two-sided p-value is obtained via
sum(dbinom(27:100, 100, .25))+sum(dbinom(1:22,100, .25))
because all the results of 27 or more hearts seen, or 22 or fewer hearts seen are "as or more extreme" as what we have observed.
Because the p-value is a probability, its value reflects uncertainty in our assessment of the relation of the data to null hypothesis. Large p-values suggest that there is not much reason to use the data to reject the null hypothesis; small p-values suggest that either the null hypothesis is false or a very rare event has occurred.
If we observed 30 hearts in 100 top-card draws, our test report would be:
binom.test(30, 100, 1/4)
The confidence interval is a random interval derived from the data that has the property that it will include the true value of the population parameter of interest with a specified probability.
stats = c(27:42) tests = lapply(stats, function(x) binom.test(x, 100, .25)) low = sapply(tests, function(x) x$conf.int[1]) hi = sapply(tests, function(x) x$conf.int[2]) ests = sapply(tests, "[[", "statistic") plot(stats, ests, ylim=c(0, .6), main="95% Confidence intervals", xlab="# hearts seen in 100 draws", ylab="estimated proportion of hearts") segments( stats, low, stats, hi) abline(h=.25, lty=2)
Exercise: For what value of the statistic would you reject the null hypothesis of probability 1/4 for heart?
Exercise: Use 99% as the coverage probability and produce the display.
Everything we've seen thus far looks at a single test in various ways.
A hallmark of work in genomics is the necessity of performing many tests, because of the large number of features and hypotheses in play.
Fact: When many tests are conducted on true null hypotheses, the distribution of the collection of p-values thereby obtained is uniform on (0,1].
To illustrate this we will produce 10000 samples from N(100,1) and conduct the t test with null mean 100, yielding 10000 p-values.
many_x = replicate(10000, rnorm(10, 100)) many_p = apply(many_x, 2, function(x) t.test(x, mu=100)$p.value) plot(density(many_p, from=0, to=1))
This display shows a limitation of density estimation on a fixed interval -- a "boundary effect", because data "up against" the boundary get sparser as you get closer to the boundary. But it is consistent with the property of a uniform distribution on [0,1]: the true density is 1.
The concept of Type I error is very general, and could be deployed to define operating characteristics for any kind of inference procedure.
This table is frequently seen in discussions of multiple testing, it is from Holmes and Huber's Modern Statistics for Modern Biology.
Suppose we have $m$ hypotheses to test and we wish the overall probability of our family of tests to have Type I error rate $\alpha$. Using the table symbols, this is $Pr(V>0) < \alpha$.
Bonferroni's method is to declare significance only for those tests with $p$-value less than $\alpha/m$.
This procedure (and related improvements for FWER control) also implies a transformation of $p$-values, that we will see shortly.
Referring to the table above, the false discovery rate (FDR) is the expected value of $V/max(R,1)$. If $R = 0$, $V = 0$ and FDR is zero. Otherwise, we can think of FDR as a kind of budget: we will reject $R$ hypotheses and our expection is that $V$ will be false. If we are "following up" with confirmatory experiments on $R$ genes and can afford "wasting" $V$ tests, we would accept an FDR of $V/R$.
Of course these are probabilistic characteristics of procedures that hold with stated probabilities when assumptions are satisfied.
Since we are so accustomed to the single-number summary $p$, it is typical to base rejection decisions on transformed $p$-values. The multtest package takes care of this.
library(multtest) options(digits=3) tab = mt.rawp2adjp(many_p) head(tab[[1]],10)
By default, mt.rawp2adp
produces a number of adjustments -- we are concerned
at present with the columns rawp (the p-values sorted from lowest on), Bonferroni
(which gives the transformation $p$ to min(1, mp)$ for $m$ tests),
and BH (the Benjamini-Hochberg FDR transformation). The implication
of this table is that with FWER control at 0.05, our data
would not support rejection of any of our
hypotheses (all of which assert that the mean of a sample of 10 random
normals is equal to 100). Likewise with FDR control at any value up to
0.77. However, we see that had we accepted an FDR control at 0.8,
we could reject 8 hypotheses.
Exercise: replace the first p-value in many_p
with 1e-6
and obtain the
associated adjusted p-values.
A useful review is available. This section addresses only one of the methods in the review, which is found to have good performance, though in some contexts the authors' alternatives appear superior.
We need the IHW package and will do some visualization as well.
library(CSHstats) library(ggplot2) library(plotly) library(IHW) library(multtest)
Here's the key resource with p-values that we want to interpret.
data(gtex_exc_chr20_b38) gtdat = gtex_exc_chr20_b38 head(gtdat,3)
What we have here are variant-expression association tests conducted in the GTEx project. Thus this is a small excerpt from a collection of "GWAS", in which the outcome is tissue-specific expression of a gene, and predictors are formed by SNP genotypes and adjustments carried out in the GTEx analysis pipeline.
Here's a composite manhattan plot:
pl1 = ggplot(gtdat[gtdat$p<.1,], aes(x=pos_b38, y=-log10(p), colour=Gene)) + geom_point() #ggplotly(pl1) pl1
The eaf
field is the frequency with which the "effect allele" was
reported by SNP-Nexus.
covariate1 = cut(gtdat$eaf, c(0,.01,.05,.1,.3,.6)) table(covariate1)
We run the IHW algorithm using the variant frequency strata, with a conjecture that very rare or very common variants may be downweighted in a way that is advantageous to discovery.
demo1 = ihw(gtdat$p, covariate1, alpha=0.05) basic_fdr = mt.rawp2adjp(gtdat$p)
The weighting procedure involves random resampling from the data, to reduce risk of overfitting, and to allow assessment of consistency across multiple "folds" or random subsets.
plot(demo1)
Compare the weighted, unweighted, and conventional rejection numbers:
newp = adj_pvalues(demo1) sum(newp < 0.05) sum(basic_fdr$adjp[,"BH"] < 0.05) sum(gtdat$p < 5e-8)
dist2gene = abs(gtdat$pos_b38-gtdat$start) covariate2 = cut(dist2gene, 5) ok = which(!is.na(covariate2)) demo2 = ihw(gtdat$p[ok], covariate2[ok], alpha=0.05) plot(demo2) newp2 = adj_pvalues(demo2) sum(newp2 < 0.05)
Various comments and references in this stackexchange address the relationship between inferential framework and approach to multiple testing. A paper by Gelman and Loken surveys the concept in an illuminating way.
A paper in the European Journal of Epidemiology presents considerations on multiple comparisons from a Bayesian perspective, arguing that the issue is "surrounded by confusion and controversy".
Here is a simple approach to visualizing 20 tests, assumed independent, in a frequenist framework, focusing on 95% confidence intervals.
# from https://link.springer.com/content/pdf/10.1007/s10654-019-00517-2.pdf # Sjolander and vanSteenlandt 2019 Eur J Epi library(forestplot) library(bayesm) set.seed(1) n = 10 J = 20 beta = 0 Y = matrix(rnorm(n*J, mean=beta), nrow=n, ncol=J) est = colMeans(Y) se = apply(X=Y, MARGIN=2, FUN=sd)/sqrt(n) q = qt(p=0.975, df=n-1) ci.l = est-q*se ci.u = est+q*se forestplot( labeltext=rep("", J), mean=est, lower=ci.l, upper=ci.u, ci.vertices=TRUE, boxsize=0.3, txt_gp=fpTxtGp(cex=2), xticks=seq(-1.5,1.5,.5), title="Frequentist 95% confidence intervals")
One hypothesis is rejected.
The authors also use the bayesm
package to produce
Bayesian credible intervals under the assumption that all tests are independent.
# from https://link.springer.com/content/pdf/10.1007/s10654-019-00517-2.pdf # Sjolander and vanSteenlandt 2019 Eur J Epi library(forestplot) library(bayesm) set.seed(1) n = 10 J = 20 beta = 0 Y = matrix(rnorm(n*J, mean=beta), nrow=n, ncol=J) X = matrix(0, nrow=J*n, ncol=J) for (j in 1:J) { X[(n*(j-1)+1):(n*j), j] = 1 } Ainv = diag(J) # cor(betaj, betak) = 0 fit = runireg( Data = list(y=as.vector(Y), X=X), Prior=list(A=solve(Ainv)), Mcmc = list(R=10000, nprint=0)) q = apply(X=fit$betadraw, MARGIN=2, FUN=quantile, probs=c(.025, .5, .975)) est = q[2,] ci.l = q[1,] ci.u = q[3,]
forestplot( labeltext=rep("", J), mean=est, lower=ci.l, upper=ci.u, ci.vertices=TRUE, boxsize=0.3, txt_gp=fpTxtGp(cex=2), xticks=seq(-1.5,1.5,.5), title="Bayesian credible intervals, assume ind.")
Finally, allowing high correlatedness among test statistics, the Bayesian approach is found to avoid any false rejections.
# from https://link.springer.com/content/pdf/10.1007/s10654-019-00517-2.pdf # Sjolander and vanSteenlandt 2019 Eur J Epi library(forestplot) library(bayesm) set.seed(1) n = 10 J = 20 beta = 0 Y = matrix(rnorm(n*J, mean=beta), nrow=n, ncol=J) X = matrix(0, nrow=J*n, ncol=J) for (j in 1:J) { X[(n*(j-1)+1):(n*j), j] = 1 } Ainv = diag(J)*(1-.95)+.95 # cor(betaj, betak) = .95 fit = runireg( Data = list(y=as.vector(Y), X=X), Prior=list(A=solve(Ainv)), Mcmc = list(R=10000)) q = apply(X=fit$betadraw, MARGIN=2, FUN=quantile, probs=c(.025, .5, .975)) est = q[2,] ci.l = q[1,] ci.u = q[3,]
forestplot( labeltext=rep("", J), mean=est, lower=ci.l, upper=ci.u, ci.vertices=TRUE, boxsize=0.3, txt_gp=fpTxtGp(cex=2), xticks=seq(-1.5,1.5,.5), title="Bayesian, assume correlated tests")
The Bayesian computations involve Monte Carlo Markov Chain simulation. Background can be obtained from Statistical Rethinking by R. McElreath.
Let's return to the TCGA ACC FOS and EGR1 data that we used to study correlation.
data(fos_ex) data(egr1_ex) fedf = data.frame(FOS=fos_ex, EGR1=egr1_ex) library(ggplot2) ggplot(fedf, aes(x=EGR1, y=FOS)) + geom_point()
We found that log transformation was useful for ameliorating the overplotting near the origin:
ggplot(fedf, aes(x=EGR1, y=FOS)) + geom_point() + scale_x_log10() + scale_y_log10()
The ACC data are annotated in a MultiAssayExperiment, and a mutation summary called OncoSign is available.
library(MultiAssayExperiment) library(CSHstats) data(accex) accex table(accex$OncoSign, exclude=NULL)
We will focus on two OncoSign classes, asking whether average log EGR1 expression is different between them.
hasonc = which(!is.na(accex$OncoSign)) ndf = data.frame(logEGR1 = log(egr1_ex[hasonc]), oncosign=accex$OncoSign[hasonc]) tpvtert = ndf |> dplyr::filter(oncosign %in% c("TP53/NF1", "TERT/ZNRF3")) ggplot(tpvtert, aes(y=logEGR1, x=factor(oncosign))) + geom_boxplot()
options(digits=6) t.test(logEGR1~oncosign, data=tpvtert, var.equal=TRUE) t.test(logEGR1~oncosign, data=tpvtert, var.equal=FALSE) # more general
The general form of linear model in statistics has "dependent variable" $y$ $$ y = \alpha + x' \beta + e $$ where $x$ is a p-vector of "covariates" -- measurements that are believed to affect the mean value of $y$, also known as "independent variables", and $e$ is a zero-mean disturbance or residual with a fixed dispersion. The parameters of interest are $\alpha$ (often called intercept) and $\beta$. Parameter interpretation in terms of units of the response relative to differences between samples with different values of $x$ should always be explicit.
mm = model.matrix(~oncosign, data=tpvtert) head(mm) tail(mm)
lm
tpvtert |> lm(logEGR1~oncosign, data=_) |> summary()
Exercise: Which t test is recovered in this analysis?
Exercise: Conduct the associated Wilcoxon test.
Exercise: Conduct the Gaussian rank test mentioned in the Gelman blog "Don't do the Wilcoxon".
Extra goody: To produce the coefficients "by hand":
solve(t(mm)%*%mm)%*%t(mm)%*%tpvtert$logEGR1
programs $(X'X)^{-1}X'y$
The analysis of variance is focused on comparison of multiple groups. The EGR1-OncoSign relationship is a reasonable example where this could be used.
ggplot(ndf, aes(y=logEGR1, x=factor(oncosign))) + geom_boxplot() nda = aov(logEGR1~factor(oncosign), data=ndf) nda summary(nda)
More common applications of linear modeling take advantage of continuity of a predictor. For a simple linear regression (one continuous covariate), the interpretation of $\beta$ is: the difference in mean of $y$ associated with a unit difference in $x$.
Regarding log EGR1 as a predictor of log FOS, we have
m1 = lm(log(fos_ex)~log(egr1_ex)) summary(m1) plot(m1)
The family of generalized linear models (GLM) embraces a number of approaches to statistical inference on associations between non-Gaussian responses and general configurations of covariates.
We'll work with FOS and EGR1 in an artificial way. We'll suppose that we have criteria for "activation":
library(CSHstats) data(fos_ex) data(egr1_ex) fos_act = 1*(fos_ex > 1000) egr1_act = 1*(egr1_ex > 2000)
We form a 2 x 2 cross classification of tumors using these definitions:
acttab = table(fos_act=fos_act, egr1_act=egr1_act) acttab
A basic question is whether the two events "EGR1 activated" and "FOS activated" are statistically associated in ACC tumors.
Fisher's exact test evaluates the hypergeometric distribution of the (1,1) cell of the table, given the marginal totals.
fisher.test(acttab, alternative="greater") # one-sided sum(dhyper(26:31, 51, 28, 31))
The odds of an event that occurs with probability $p$ is given by $p/(1-p)$. Abbreviate to $odds(p)$
The odds ratio for two events with probabilities $p_1$ and $p_2$
is $odds(p_1)/odds(p_2)$. With a 2x2 table this can be estimated with
the cross ratio "ad/bc". But the odds ratio estimated in fisher.test
is different, a conditional maximum likelihood estimate. The difference
is seldom meaningful.
odds = function(x) {p=mean(x); p/(1-p)} odds(fos_act[egr1_act==TRUE])/odds(fos_act[egr1_act==FALSE]) acttab[1,1]*acttab[2,2]/(acttab[1,2]*acttab[2,1])
We model the conditional probability that FOS is activated, depending on whether or note EGR1 is activated, using the definitions given above. The model has the form
$$ \mbox{logit} Pr(FOS > 1000|EGR1>2000) = \alpha + \beta x $$
Here logit(p) = log(p/(1-p)) is called a "link function". It transforms the outcome of interest (a probability) from the interval (0,1) to the whole real line. The interpretation of $\beta$ is: "log odds ratio". If we exponentiate beta, we obtain the effect of a change in x of one unit on the odds of the response event.
g1 = glm(fos_act ~ egr1_act, family=binomial()) summary(g1)
Parameter interpretation. The intercept parameter has value
r round(g1$coef[1],4)
, which is the logit of the probability
that FOS is activated, given that EGR1 is NOT activated.
The "beta" parameter is the log odds ratio (comparing probability of FOS activation between the EGR1 conditions -- not activated and activated).
log(4.784)
Fisher's exact test and simple logistic regression are seen to be consistent with one another in this example. The advantage of the logistic regression framework is that more potential sources of variation in probability of response can be incorporated.
We have data on YY1 expression in ACC and will declare it activated when expressed at a level greater than 1500.
data(yy1_ex) yy1_act = yy1_ex > 1500 g2 = glm(fos_act ~ egr1_act+yy1_act, family=binomial()) summary(g2)
The inclusion of information on YY1 does not seem to be informative on FOS activation.
The family of generalized linear models can accommodate many forms of variation in a response of interest, to which are associated a number of "predictor" variables.
glmtab = structure(list(dists = c("gaussian", "binomial", "poisson", "gamma", "negbin"), links = c("identity", "logit", "log", "reciprocal", "log"), varfuns = c("const", "np(1-p)", "mu", "mu^2", "mu+mu^2/theta" )), class = "data.frame", row.names = c(NA, -5L))
kable(glmtab)
DESeq2 and edgeR use the negative binomial GLM for inference on differential
expression in RNA-seq. You will hear much more about this. The glm.nb
function
in the MASS package allows you to fit negative binomial regressions with
data in data.frames.
A very strong assumption in the use of linear and generalized linear models is the mutual independence of reponses. When clustering of responses is present (for example, when analyzing data collected on human families, or pairs of eyes, or mice in litters), additional model structure should be introduced to accommodate lack of independence.
A framework approach developed in 1982 (Laird and Ware) expands the linear model to a model known as "mixed effects":
$$ y_{ij} = \alpha + \beta x_{ij} + a_i + e_{ij} $$
with $i$ indexing clusters and $j$ responses within clusters. The "random effect" $a_i$ is assumed to follow $N(0, \sigma^2_b)$ (between-cluster variance), and is independent of $e_{ij} \sim N(0, \sigma^2_w)$.
The intraclass correlation coefficient measures the departure from independence, and has form $\rho = \sigma^2_b/(\sigma^2_b + \sigma^2_w)$.
Approximate mixed effects models for the GLM families can be fit using glmmPQL in MASS.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.