We are going to focus on the framework known "null hypothesis significance testing". A "null hypothesis" is typically denoted $H_0$ and expresses a claim that often involves a lack of effect of an experimental intervention on an outcome of interest. An "alternative hypothesis" is denoted $H_1$, and it should express a specific measurable impact of an intervention.
In the case of our card deck from chapter S1, it would be typical to postulate
Note that in this case, $H_1$ is fairly vague -- it could be less specific, referring only to lack of fairness. $H_1$ specifies which suit has an excess representation in the deck, but it is not specific about how severe the tampering has been.
In experimental design, it is important to have a clear view of
Only when both of these are in focus can we establish the operating characteristics of our proposed experiment.
In the significance testing framework, we formalize operating characteristics as probabilities of inferential error, described below.
Before we plunge in to more formalities, let's recall how to work with the deck of cards.
Here we create a fair deck:
library(CSHstats) d = build_deck() table(suits(d), faces(d))
Now let's create a severely biased deck:
bd = d # just a copy bd[18] = bd[3] bd[19] = bd[3] bd[20] = bd[4] bd[21] = bd[5] table(suits(bd), faces(bd))
Here are our tools for shuffling and drawing the top card:
shuffle_deck = function(d) sample(d, size=length(d), replace=FALSE) top_draw = function(d) shuffle_deck(d)[1]
Let's make a reproducible random draw:
set.seed(1234) t1 = top_draw(bd) t1
Exercise: create a vector holding 100 draws from the biased deck. Estimate the probability that the suit of the top card is diamond. Use
diamond_sign = function() "\U2662"
to test the suit of each draw. For example:
t1 == diamond_sign()
Note: starting with set.seed(2345), the estimated probability of diamond as suit of top draw that I found was 0.16.
Let $C_1$ denote the suit of the top card revealed after a fair shuffle. We can state a hypothesis about the fairness of the deck under repeated draws of top card after shuffling as
spade_sign = function() "\U2660" diamond_sign = function() "\U2662" club_sign = function() "\U2663"
$$ H_0: Pr(C_1 = \heartsuit) = Pr(C_1 = \diamondsuit) = Pr(C_1 = \clubsuit) = Pr(C_1 = \spadesuit) = 1/4 $$
In a frequentist framework for statistical inference, we define procedures for testing (null) hypotheses with specified error probabilities.
A Type I error occurs when the null hypothesis is true but the test results in the assertion that it is false. Traditionally we try to keep the probability of Type I errors below 5\%.
A Type II error occurs when the null hypothesis is false but the test does not result in an assertion that it is false. Traditionally we try to keep the probability of Type II errors below 20\%.
Synonyms: We often use size as synonymous with Type I error probability, and power as one minus the Type II error probability
1: Propose a procedure for testing $H_0: Pr(C_1 = \heartsuit) = 1/4$. Assume you have the results of top card draws from N shuffles. What would such a procedure look like?
suppressPackageStartupMessages({ suppressMessages({ library(CSHstats) library(EnvStats) }) }) d = build_deck() shuffle_deck = function(d) sample(d, size=length(d), replace=FALSE) heart_sign = function() "\U2661" set.seed(4141) top_draw = function(d) shuffle_deck(d)[1] N = 100 mydat = replicate(N, suits(top_draw(d))==heart_sign()) phat = function(dat) sum(dat)/length(dat) phat(mydat)
With 100 shuffles we have an estimate of the probability of a heart. Is our experimental result consistent with $H_0$?
With a larger sample size:
N = 1000 mydat = replicate(N, suits(top_draw(d))==heart_sign()) sum(mydat)/length(mydat)
Let's use the procedure |phat(dat)-.25|>.01
as our criterion for
rejecting
$H_0: Pr(C_1 = \heartsuit) = 1/4$
What is the Type I error rate
for a fair deck for the experiment based on 100 shuffles?
N = 100 NSIM = 1000 tsts = replicate(NSIM, abs(mean(replicate(N, suits(top_draw(d))==heart_sign()))-.25)>.01) mean(tsts)
Parameterize the "delta", which was 0.01 in the previous run. Let's get the rejection rate estimates for two sample sizes:
prej = function(delta=.01, nullval=.25, NSIM, N) { mean(replicate(NSIM, abs(mean(replicate(N, suits(top_draw(d))==heart_sign()))-nullval)>delta)) } prej(delta=.02, NSIM=1000, N=100) prej(delta=.02, NSIM=1000, N=500)
Our home-cooked test has a Type I error rate that is much too high for standard practice, and depends on the sample size.
Let's instead use the built in procedure for testing hypotheses with binomial outcomes.
The brej
function defined below has two layers:
- inner: given N draws, test that the sum of the number of
hearts is consistent with a specified null value (1/4 by
default), with a specified Type I error rate (argument alpha)
- outer: replicate the inner process NSIM times to obtain
NSIM indicator variables taking value 1 if the test rejected
and 0 otherwise
We then estimate rejection probabilities by taking the means of the indicator variables for different experimental and replication setups.
brej = function(deck, nullval=.25, alpha=0.05, NSIM, N) { replicate(NSIM, { dat = replicate(N, suits(top_draw(deck))==heart_sign()) binom.test(sum(dat), N, nullval)$p.value < alpha }) } mean(brej(d,NSIM=1000, N=100)) mean(brej(d,NSIM=1000, N=500)) mean(brej(d,NSIM=1000, N=100, alpha=0.01))
We have an indication here that the procedure stabilizes the Type I error rate for different designs (sample sizes) and can accommodate different significance levels.
It is often useful to sketch the power of a test procedure over a series of values of an element of the experimental/testing setup. We'll consider how the power to reject the null varies with the sample size, for two alternatives to the null.
We make a biased deck by switching one diamond to a heart. For such a deck, the probability of a heart is 14/52, greater than 1/4.
bd = d bd[18] = bd[4] table(suits(bd), faces(bd))
What kind of sample size do we need to get good power to detect this exception?
ssizes = c(10,50, 100, 300,750) rejps = sapply(ssizes, function(x) mean(brej(bd, NSIM=100, N=x))) plot(ssizes, rejps, main="detect one extra heart")
With two switches, the probability of a heart increases to 15/52.
bd[19] = bd[5] table(suits(bd), faces(bd)) rejps = sapply(ssizes, function(x) mean(brej(bd, NSIM=100, N=x))) plot(ssizes, rejps, main="detect two extra hearts")
Finally, we show that under the null hypothesis (deck is fair) the rejection rate is approximately 0.05 for a range of sample sizes.
rejps_n = sapply(ssizes, function(x) mean(brej(d, NSIM=100, N=x))) plot(ssizes, rejps_n, main="preserve Type I error rate", ylim=c(0,1)) abline(h=0.05, lty=2)
Classical frequentist inference aims to control probabilities of errors of Type I (reject a null hypothesis that is true) and Type II (fail to reject a false null). The power of a test procedure is one minus its Type II error probability.
Typically the null hypothesis specifies the value of a parameter of a population distribution
R's binom.test
is given
and produces a critical region using the binomial distribution
When the sum falls in the critical region, the null hypothesis is rejected
Verification of achievement of the desired Type I and Type II error probabilities for a testing procedure can be pursued via simulation
The power curve for a sequence of experimental conditions can be produced via simulation
Analytical tools for power computation are also available. For example, in the EnvStats package, we find
library(EnvStats) propTestPower(750, p.or.p1=13/52, p0=.25 ) propTestPower(750, p.or.p1=14/52, p0=.25 ) propTestPower(750, p.or.p1=15/52, p0=.25 )
An exercise is to reconcile these results with the simulation-based results given above.
Another is to find the sample size needed to achieve 80% power to detect a single switch from diamond to heart in an otherwise fair deck.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.