In vjcitn/CSHstats: discussions and exercises on classical statistics for CSHL 2022

suppressPackageStartupMessages({
 suppressMessages({
  library(multtest)
  library(bayesm)
  library(forestplot)
  library(ggplot2)
  library(CSHstats)
  library(dplyr)
  library(knitr)
})
})

Review of some inference concepts

Error rates: size and power

Null hypothesis testing framework: we typically aim to assess whether an intervention affects a parameter value for a certain population
- does the average value of some quantity change after treatment?
- does changing the composition of a deck of cards affect the probability of seeing a given suit when drawing a card?
Type I error: reject the null when it is actually true
Type II error: fail to reject null when it is actually false

p-value: a condensation of the data and assumptions

The p-value for a test is the probability of observing the statistic seen in the experiment or any more extreme value of the statistic given that the null hypothesis is true.

binom.test(27, 100, 1/4)

This two-sided p-value is obtained via

sum(dbinom(27:100, 100, .25))+sum(dbinom(1:22,100, .25))

because all the results of 27 or more hearts seen, or 22 or fewer hearts seen are "as or more extreme" as what we have observed.

Because the p-value is a probability, its value reflects uncertainty in our assessment of the relation of the data to null hypothesis. Large p-values suggest that there is not much reason to use the data to reject the null hypothesis; small p-values suggest that either the null hypothesis is false or a very rare event has occurred.

Confidence interval: an alternative expression of uncertainty

If we observed 30 hearts in 100 top-card draws, our test report would be:

binom.test(30, 100, 1/4)

The confidence interval is a random interval derived from the data that has the property that it will include the true value of the population parameter of interest with a specified probability.

stats = c(27:42)
tests = lapply(stats, function(x) binom.test(x, 100, .25))
low = sapply(tests, function(x) x$conf.int[1])
hi = sapply(tests, function(x) x$conf.int[2])
ests = sapply(tests, "[[", "statistic")
plot(stats, ests, ylim=c(0, .6), main="95% Confidence intervals", xlab="# hearts seen in 100 draws",
  ylab="estimated proportion of hearts")
segments( stats, low, stats, hi)
abline(h=.25, lty=2)

Exercise: For what value of the statistic would you reject the null hypothesis of probability 1/4 for heart?

Exercise: Use 99% as the coverage probability and produce the display.

Multiple comparisons

Everything we've seen thus far looks at a single test in various ways.

A hallmark of work in genomics is the necessity of performing many tests, because of the large number of features and hypotheses in play.

Fact: When many tests are conducted on true null hypotheses, the distribution of the collection of p-values thereby obtained is uniform on (0,1].

To illustrate this we will produce 10000 samples from N(100,1) and conduct the t test with null mean 100, yielding 10000 p-values.

many_x = replicate(10000, rnorm(10, 100))
many_p = apply(many_x, 2, function(x) t.test(x, mu=100)$p.value)
plot(density(many_p, from=0, to=1))

This display shows a limitation of density estimation on a fixed interval -- a "boundary effect", because data "up against" the boundary get sparser as you get closer to the boundary. But it is consistent with the property of a uniform distribution on [0,1]: the true density is 1.

Bonferroni's correction to achieve Family-Wise Error Rate (FWER) control

The concept of Type I error is very general, and could be deployed to define operating characteristics for any kind of inference procedure.

This table is frequently seen in discussions of multiple testing, it is from Holmes and Huber's Modern Statistics for Modern Biology.

decisiontab

Suppose we have $m$ hypotheses to test and we wish the overall probability of our family of tests to have Type I error rate $\alpha$. Using the table symbols, this is $Pr(V>0) < \alpha$.

Bonferroni's method is to declare significance only for those tests with $p$-value less than $\alpha/m$.

This procedure (and related improvements for FWER control) also implies a transformation of $p$-values, that we will see shortly.

False discovery rate

Referring to the table above, the false discovery rate (FDR) is the expected value of $V/max(R,1)$. If $R = 0$, $V = 0$ and FDR is zero. Otherwise, we can think of FDR as a kind of budget: we will reject $R$ hypotheses and our expection is that $V$ will be false. If we are "following up" with confirmatory experiments on $R$ genes and can afford "wasting" $V$ tests, we would accept an FDR of $V/R$.

Of course these are probabilistic characteristics of procedures that hold with stated probabilities when assumptions are satisfied.

Adjusted p-values

Since we are so accustomed to the single-number summary $p$, it is typical to base rejection decisions on transformed $p$-values. The multtest package takes care of this.

library(multtest)
options(digits=3)
tab = mt.rawp2adjp(many_p)
head(tab[[1]],10)

By default, mt.rawp2adp produces a number of adjustments -- we are concerned at present with the columns rawp (the p-values sorted from lowest on), Bonferroni (which gives the transformation $p$ to min(1, mp)$ for $m$ tests), and BH (the Benjamini-Hochberg FDR transformation). The implication of this table is that with FWER control at 0.05, our data would not support rejection of any of our hypotheses (all of which assert that the mean of a sample of 10 random normals is equal to 100). Likewise with FDR control at any value up to 0.77. However, we see that had we accepted an FDR control at 0.8, we could reject 8 hypotheses.

Exercise: replace the first p-value in many_p with 1e-6 and obtain the associated adjusted p-values.

Adjustments via weighted FDR

A useful review is available. This section addresses only one of the methods in the review, which is found to have good performance, though in some contexts the authors' alternatives appear superior.

Setup

We need the IHW package and will do some visualization as well.

 library(CSHstats)
 library(ggplot2)
 library(plotly)
 library(IHW)
 library(multtest)

Here's the key resource with p-values that we want to interpret.

data(gtex_exc_chr20_b38)
gtdat = gtex_exc_chr20_b38
head(gtdat,3)

What we have here are variant-expression association tests conducted in the GTEx project. Thus this is a small excerpt from a collection of "GWAS", in which the outcome is tissue-specific expression of a gene, and predictors are formed by SNP genotypes and adjustments carried out in the GTEx analysis pipeline.

Here's a composite manhattan plot:

pl1 = ggplot(gtdat[gtdat$p<.1,], aes(x=pos_b38, y=-log10(p), colour=Gene)) + geom_point()
#ggplotly(pl1)
pl1

Demonstration 1: Use allele frequency to form strata for FDR weighting

The eaf field is the frequency with which the "effect allele" was reported by SNP-Nexus.

covariate1 = cut(gtdat$eaf, c(0,.01,.05,.1,.3,.6))
table(covariate1)

We run the IHW algorithm using the variant frequency strata, with a conjecture that very rare or very common variants may be downweighted in a way that is advantageous to discovery.

demo1 = ihw(gtdat$p, covariate1, alpha=0.05)
basic_fdr = mt.rawp2adjp(gtdat$p)

The weighting procedure involves random resampling from the data, to reduce risk of overfitting, and to allow assessment of consistency across multiple "folds" or random subsets.

plot(demo1)

Compare the weighted, unweighted, and conventional rejection numbers:

newp = adj_pvalues(demo1)
sum(newp < 0.05)
sum(basic_fdr$adjp[,"BH"] < 0.05)
sum(gtdat$p < 5e-8)

Demonstration 2: Use distance to gene to weight

dist2gene = abs(gtdat$pos_b38-gtdat$start)
covariate2 = cut(dist2gene, 5)
ok = which(!is.na(covariate2))
demo2 = ihw(gtdat$p[ok], covariate2[ok], alpha=0.05)
plot(demo2)
newp2 = adj_pvalues(demo2)
sum(newp2 < 0.05)

Optional: Considerations on Bayesian multiple testing

Various comments and references in this stackexchange address the relationship between inferential framework and approach to multiple testing. A paper by Gelman and Loken surveys the concept in an illuminating way.

A paper in the European Journal of Epidemiology presents considerations on multiple comparisons from a Bayesian perspective, arguing that the issue is "surrounded by confusion and controversy".

Here is a simple approach to visualizing 20 tests, assumed independent, in a frequenist framework, focusing on 95% confidence intervals.

# from https://link.springer.com/content/pdf/10.1007/s10654-019-00517-2.pdf
# Sjolander and vanSteenlandt 2019 Eur J Epi

library(forestplot)
library(bayesm)
set.seed(1)

n = 10
J = 20
beta = 0
Y = matrix(rnorm(n*J, mean=beta), nrow=n, ncol=J)
est = colMeans(Y)
se = apply(X=Y, MARGIN=2, FUN=sd)/sqrt(n)
q = qt(p=0.975, df=n-1)
ci.l = est-q*se
ci.u = est+q*se

forestplot(
 labeltext=rep("", J),
 mean=est, lower=ci.l,
 upper=ci.u, ci.vertices=TRUE,
 boxsize=0.3, txt_gp=fpTxtGp(cex=2),
 xticks=seq(-1.5,1.5,.5), title="Frequentist 95% confidence intervals")

One hypothesis is rejected.

The authors also use the bayesm package to produce Bayesian credible intervals under the assumption that all tests are independent.

# from https://link.springer.com/content/pdf/10.1007/s10654-019-00517-2.pdf
# Sjolander and vanSteenlandt 2019 Eur J Epi

library(forestplot)
library(bayesm)
set.seed(1)

n = 10
J = 20
beta = 0

Y = matrix(rnorm(n*J, mean=beta), nrow=n, ncol=J)

X = matrix(0, nrow=J*n, ncol=J)
for (j in 1:J) {
  X[(n*(j-1)+1):(n*j), j] = 1
  }
Ainv = diag(J) # cor(betaj, betak) = 0

fit = runireg( Data = list(y=as.vector(Y),
   X=X), Prior=list(A=solve(Ainv)),
   Mcmc = list(R=10000, nprint=0))

q = apply(X=fit$betadraw,
  MARGIN=2, FUN=quantile, probs=c(.025, .5, .975))
est = q[2,]
ci.l = q[1,]
ci.u = q[3,]

forestplot(
 labeltext=rep("", J),
 mean=est, lower=ci.l,
 upper=ci.u, ci.vertices=TRUE,
 boxsize=0.3, txt_gp=fpTxtGp(cex=2),
 xticks=seq(-1.5,1.5,.5), title="Bayesian credible intervals, assume ind.")

Finally, allowing high correlatedness among test statistics, the Bayesian approach is found to avoid any false rejections.

# from https://link.springer.com/content/pdf/10.1007/s10654-019-00517-2.pdf
# Sjolander and vanSteenlandt 2019 Eur J Epi

library(forestplot)
library(bayesm)
set.seed(1)

n = 10
J = 20
beta = 0

Y = matrix(rnorm(n*J, mean=beta), nrow=n, ncol=J)

X = matrix(0, nrow=J*n, ncol=J)
for (j in 1:J) {
  X[(n*(j-1)+1):(n*j), j] = 1
  }
Ainv = diag(J)*(1-.95)+.95 # cor(betaj, betak) = .95

fit = runireg( Data = list(y=as.vector(Y),
   X=X), Prior=list(A=solve(Ainv)),
   Mcmc = list(R=10000))

q = apply(X=fit$betadraw,
  MARGIN=2, FUN=quantile, probs=c(.025, .5, .975))
est = q[2,]
ci.l = q[1,]
ci.u = q[3,]

forestplot(
 labeltext=rep("", J),
 mean=est, lower=ci.l,
 upper=ci.u, ci.vertices=TRUE,
 boxsize=0.3, txt_gp=fpTxtGp(cex=2),
 xticks=seq(-1.5,1.5,.5), title="Bayesian, assume correlated tests")

The Bayesian computations involve Monte Carlo Markov Chain simulation. Background can be obtained from Statistical Rethinking by R. McElreath.

Linear models

Data setup

Let's return to the TCGA ACC FOS and EGR1 data that we used to study correlation.

data(fos_ex)
data(egr1_ex)
fedf = data.frame(FOS=fos_ex, EGR1=egr1_ex)
library(ggplot2)
ggplot(fedf, aes(x=EGR1, y=FOS)) + geom_point()

We found that log transformation was useful for ameliorating the overplotting near the origin:

ggplot(fedf, aes(x=EGR1, y=FOS)) + geom_point() + scale_x_log10() + scale_y_log10()

The ACC data are annotated in a MultiAssayExperiment, and a mutation summary called OncoSign is available.

library(MultiAssayExperiment)
library(CSHstats)
data(accex)
accex
table(accex$OncoSign, exclude=NULL)

Re-expressing the two-sample test

We will focus on two OncoSign classes, asking whether average log EGR1 expression is different between them.

hasonc = which(!is.na(accex$OncoSign))
ndf = data.frame(logEGR1 = log(egr1_ex[hasonc]), oncosign=accex$OncoSign[hasonc])
tpvtert = ndf |> dplyr::filter(oncosign %in% c("TP53/NF1", "TERT/ZNRF3"))
ggplot(tpvtert, aes(y=logEGR1, x=factor(oncosign))) + geom_boxplot()

There are two t tests

options(digits=6)
t.test(logEGR1~oncosign, data=tpvtert, var.equal=TRUE)
t.test(logEGR1~oncosign, data=tpvtert, var.equal=FALSE) # more general

The linear model has one of these as a special case

The general form of linear model in statistics has "dependent variable" $y$ $$ y = \alpha + x' \beta + e $$ where $x$ is a p-vector of "covariates" -- measurements that are believed to affect the mean value of $y$, also known as "independent variables", and $e$ is a zero-mean disturbance or residual with a fixed dispersion. The parameters of interest are $\alpha$ (often called intercept) and $\beta$. Parameter interpretation in terms of units of the response relative to differences between samples with different values of $x$ should always be explicit.

A dummy variable design matrix

mm = model.matrix(~oncosign, data=tpvtert)
head(mm)
tail(mm)

Recovering the t test with `lm`

tpvtert |> lm(logEGR1~oncosign, data=_) |> summary()

Exercise: Which t test is recovered in this analysis?

Exercise: Conduct the associated Wilcoxon test.

Exercise: Conduct the Gaussian rank test mentioned in the Gelman blog "Don't do the Wilcoxon".

Extra goody: To produce the coefficients "by hand":

solve(t(mm)%*%mm)%*%t(mm)%*%tpvtert$logEGR1

programs $(X'X)^{-1}X'y$

Analysis of variance: F test

The analysis of variance is focused on comparison of multiple groups. The EGR1-OncoSign relationship is a reasonable example where this could be used.

ggplot(ndf, aes(y=logEGR1, x=factor(oncosign))) + geom_boxplot()
nda = aov(logEGR1~factor(oncosign), data=ndf)
nda
summary(nda)

Linear regression, parameter estimation, goodness of fit

More common applications of linear modeling take advantage of continuity of a predictor. For a simple linear regression (one continuous covariate), the interpretation of $\beta$ is: the difference in mean of $y$ associated with a unit difference in $x$.

Regarding log EGR1 as a predictor of log FOS, we have

m1 = lm(log(fos_ex)~log(egr1_ex))
summary(m1)
plot(m1)

Confounding

GLMs: binary, counted, non-Gaussian responses

The family of generalized linear models (GLM) embraces a number of approaches to statistical inference on associations between non-Gaussian responses and general configurations of covariates.

Prologue: Fisher's exact test for 2 x 2 table

We'll work with FOS and EGR1 in an artificial way. We'll suppose that we have criteria for "activation":

FOS is activated if our RNA-seq measurement in ACC exceeds 1000
EGR1 is activated if our RNA-seq measurement in ACC exceeds 2000

library(CSHstats)
data(fos_ex)
data(egr1_ex)
fos_act = 1*(fos_ex > 1000)
egr1_act = 1*(egr1_ex > 2000)

We form a 2 x 2 cross classification of tumors using these definitions:

acttab = table(fos_act=fos_act, egr1_act=egr1_act)
acttab

A basic question is whether the two events "EGR1 activated" and "FOS activated" are statistically associated in ACC tumors.

Fisher's exact test evaluates the hypergeometric distribution of the (1,1) cell of the table, given the marginal totals.

fisher.test(acttab, alternative="greater") # one-sided
sum(dhyper(26:31, 51, 28, 31))

The odds of an event that occurs with probability $p$ is given by $p/(1-p)$. Abbreviate to $odds(p)$

The odds ratio for two events with probabilities $p_1$ and $p_2$ is $odds(p_1)/odds(p_2)$. With a 2x2 table this can be estimated with the cross ratio "ad/bc". But the odds ratio estimated in fisher.test is different, a conditional maximum likelihood estimate. The difference is seldom meaningful.

odds = function(x) {p=mean(x); p/(1-p)}
odds(fos_act[egr1_act==TRUE])/odds(fos_act[egr1_act==FALSE])
acttab[1,1]*acttab[2,2]/(acttab[1,2]*acttab[2,1])

Binary response: logistic regression

We model the conditional probability that FOS is activated, depending on whether or note EGR1 is activated, using the definitions given above. The model has the form

$$ \mbox{logit} Pr(FOS > 1000|EGR1>2000) = \alpha + \beta x $$

Here logit(p) = log(p/(1-p)) is called a "link function". It transforms the outcome of interest (a probability) from the interval (0,1) to the whole real line. The interpretation of $\beta$ is: "log odds ratio". If we exponentiate beta, we obtain the effect of a change in x of one unit on the odds of the response event.

g1 = glm(fos_act ~ egr1_act, family=binomial())
summary(g1)

Parameter interpretation. The intercept parameter has value r round(g1$coef[1],4), which is the logit of the probability that FOS is activated, given that EGR1 is NOT activated.

The "beta" parameter is the log odds ratio (comparing probability of FOS activation between the EGR1 conditions -- not activated and activated).

log(4.784)

Multiple logistic regression

Fisher's exact test and simple logistic regression are seen to be consistent with one another in this example. The advantage of the logistic regression framework is that more potential sources of variation in probability of response can be incorporated.

We have data on YY1 expression in ACC and will declare it activated when expressed at a level greater than 1500.

data(yy1_ex)
yy1_act = yy1_ex > 1500
g2 = glm(fos_act ~ egr1_act+yy1_act, family=binomial())
summary(g2)

The inclusion of information on YY1 does not seem to be informative on FOS activation.

General case: link and variance functions

The family of generalized linear models can accommodate many forms of variation in a response of interest, to which are associated a number of "predictor" variables.

glmtab = structure(list(dists = c("gaussian", "binomial", "poisson", "gamma", 
"negbin"), links = c("identity", "logit", "log", "reciprocal", 
"log"), varfuns = c("const", "np(1-p)", "mu", "mu^2", "mu+mu^2/theta"
)), class = "data.frame", row.names = c(NA, -5L))

kable(glmtab)

DESeq2 and edgeR use the negative binomial GLM for inference on differential expression in RNA-seq. You will hear much more about this. The glm.nb function in the MASS package allows you to fit negative binomial regressions with data in data.frames.

Hierarchical models: alternatives to independence

A very strong assumption in the use of linear and generalized linear models is the mutual independence of reponses. When clustering of responses is present (for example, when analyzing data collected on human families, or pairs of eyes, or mice in litters), additional model structure should be introduced to accommodate lack of independence.

Random effects models for clustered observations

A framework approach developed in 1982 (Laird and Ware) expands the linear model to a model known as "mixed effects":

$$ y_{ij} = \alpha + \beta x_{ij} + a_i + e_{ij} $$

with $i$ indexing clusters and $j$ responses within clusters. The "random effect" $a_i$ is assumed to follow $N(0, \sigma^2_b)$ (between-cluster variance), and is independent of $e_{ij} \sim N(0, \sigma^2_w)$.

The intraclass correlation coefficient measures the departure from independence, and has form $\rho = \sigma^2_b/(\sigma^2_b + \sigma^2_w)$.

Approximate mixed effects models for the GLM families can be fit using glmmPQL in MASS.

vjcitn/CSHstats documentation built on Jan. 9, 2025, 8:37 p.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

vjcitn/CSHstats
discussions and exercises on classical statistics for CSHL 2022

In vjcitn/CSHstats: discussions and exercises on classical statistics for CSHL 2022

Review of some inference concepts

Error rates: size and power

p-value: a condensation of the data and assumptions

Confidence interval: an alternative expression of uncertainty

Multiple comparisons

Bonferroni's correction to achieve Family-Wise Error Rate (FWER) control

False discovery rate

Adjusted p-values

Adjustments via weighted FDR

Setup

Demonstration 1: Use allele frequency to form strata for FDR weighting

Demonstration 2: Use distance to gene to weight

Optional: Considerations on Bayesian multiple testing

Linear models

Data setup

Re-expressing the two-sample test

There are two t tests

The linear model has one of these as a special case

A dummy variable design matrix

Recovering the t test with `lm`

Analysis of variance: F test

Linear regression, parameter estimation, goodness of fit

Confounding

GLMs: binary, counted, non-Gaussian responses

Prologue: Fisher's exact test for 2 x 2 table

Binary response: logistic regression

Multiple logistic regression

General case: link and variance functions

Hierarchical models: alternatives to independence

Random effects models for clustered observations

R Package Documentation

Browse R Packages

We want your feedback!

vjcitn/CSHstats discussions and exercises on classical statistics for CSHL 2022

In vjcitn/CSHstats: discussions and exercises on classical statistics for CSHL 2022

Review of some inference concepts

Error rates: size and power

p-value: a condensation of the data and assumptions

Confidence interval: an alternative expression of uncertainty

Multiple comparisons

Bonferroni's correction to achieve Family-Wise Error Rate (FWER) control

False discovery rate

Adjusted p-values

Adjustments via weighted FDR

Setup

Demonstration 1: Use allele frequency to form strata for FDR weighting

Demonstration 2: Use distance to gene to weight

Optional: Considerations on Bayesian multiple testing

Linear models

Data setup

Re-expressing the two-sample test

There are two t tests

The linear model has one of these as a special case

A dummy variable design matrix

Recovering the t test with lm

Analysis of variance: F test

Linear regression, parameter estimation, goodness of fit

Confounding

GLMs: binary, counted, non-Gaussian responses

Prologue: Fisher's exact test for 2 x 2 table

Binary response: logistic regression

Multiple logistic regression

General case: link and variance functions

Hierarchical models: alternatives to independence

Random effects models for clustered observations

R Package Documentation

Browse R Packages

We want your feedback!

vjcitn/CSHstats
discussions and exercises on classical statistics for CSHL 2022

Recovering the t test with `lm`