Inference

Basic concepts

We are going to focus on the framework known "null hypothesis significance testing". A "null hypothesis" is typically denoted $H_0$ and expresses a claim that often involves a lack of effect of an experimental intervention on an outcome of interest. An "alternative hypothesis" is denoted $H_1$, and it should express a specific measurable impact of an intervention.

In the case of our card deck from chapter S1, it would be typical to postulate

Note that in this case, $H_1$ is fairly vague -- it could be less specific, referring only to lack of fairness. $H_1$ specifies which suit has an excess representation in the deck, but it is not specific about how severe the tampering has been.

In experimental design, it is important to have a clear view of

Only when both of these are in focus can we establish the operating characteristics of our proposed experiment.

In the significance testing framework, we formalize operating characteristics as probabilities of inferential error, described below.

Before we plunge in to more formalities, let's recall how to work with the deck of cards.

Here we create a fair deck:

library(CSHstats)
d = build_deck()
table(suits(d), faces(d))

Now let's create a severely biased deck:

bd = d  # just a copy
bd[18] = bd[3]
bd[19] = bd[3]
bd[20] = bd[4]
bd[21] = bd[5]
table(suits(bd), faces(bd))

Here are our tools for shuffling and drawing the top card:

shuffle_deck = function(d) sample(d, size=length(d), replace=FALSE)
top_draw = function(d) shuffle_deck(d)[1]

Let's make a reproducible random draw:

set.seed(1234)
t1 = top_draw(bd)
t1

Exercise: create a vector holding 100 draws from the biased deck. Estimate the probability that the suit of the top card is diamond. Use

diamond_sign = function() "\U2662"

to test the suit of each draw. For example:

t1 == diamond_sign()

Note: starting with set.seed(2345), the estimated probability of diamond as suit of top draw that I found was 0.16.

The fairness hypothesis

Let $C_1$ denote the suit of the top card revealed after a fair shuffle. We can state a hypothesis about the fairness of the deck under repeated draws of top card after shuffling as

spade_sign = function() "\U2660"
diamond_sign = function() "\U2662"
club_sign = function() "\U2663"

$$ H_0: Pr(C_1 = \heartsuit) = Pr(C_1 = \diamondsuit) = Pr(C_1 = \clubsuit) = Pr(C_1 = \spadesuit) = 1/4 $$

Operating characteristics: Type I and Type II errors

In a frequentist framework for statistical inference, we define procedures for testing (null) hypotheses with specified error probabilities.

Exercise

1: Propose a procedure for testing $H_0: Pr(C_1 = \heartsuit) = 1/4$. Assume you have the results of top card draws from N shuffles. What would such a procedure look like?

Simulating a series of experiments; parameter estimate

suppressPackageStartupMessages({
suppressMessages({
library(CSHstats)
library(EnvStats)
})
})
d = build_deck()
shuffle_deck = function(d) sample(d, size=length(d), replace=FALSE)
heart_sign = function() "\U2661"
set.seed(4141)
top_draw = function(d) shuffle_deck(d)[1]
N = 100
mydat = replicate(N, suits(top_draw(d))==heart_sign())
phat = function(dat) sum(dat)/length(dat)
phat(mydat)

With 100 shuffles we have an estimate of the probability of a heart. Is our experimental result consistent with $H_0$?

With a larger sample size:

N = 1000
mydat = replicate(N, suits(top_draw(d))==heart_sign())
sum(mydat)/length(mydat)

An improvised test

Let's use the procedure |phat(dat)-.25|>.01 as our criterion for rejecting $H_0: Pr(C_1 = \heartsuit) = 1/4$ What is the Type I error rate for a fair deck for the experiment based on 100 shuffles?

N = 100
NSIM = 1000
tsts = replicate(NSIM,
   abs(mean(replicate(N, suits(top_draw(d))==heart_sign()))-.25)>.01)
mean(tsts)

Parameterize the "delta", which was 0.01 in the previous run. Let's get the rejection rate estimates for two sample sizes:

prej = function(delta=.01, nullval=.25, NSIM, N) {
 mean(replicate(NSIM,
   abs(mean(replicate(N, suits(top_draw(d))==heart_sign()))-nullval)>delta))
}
prej(delta=.02, NSIM=1000, N=100)
prej(delta=.02, NSIM=1000, N=500)

Our home-cooked test has a Type I error rate that is much too high for standard practice, and depends on the sample size.

A properly calibrated procedure

Let's instead use the built in procedure for testing hypotheses with binomial outcomes.

The brej function defined below has two layers: - inner: given N draws, test that the sum of the number of hearts is consistent with a specified null value (1/4 by default), with a specified Type I error rate (argument alpha) - outer: replicate the inner process NSIM times to obtain NSIM indicator variables taking value 1 if the test rejected and 0 otherwise

We then estimate rejection probabilities by taking the means of the indicator variables for different experimental and replication setups.

brej = function(deck, nullval=.25, alpha=0.05, NSIM, N) {
 replicate(NSIM, {
   dat = replicate(N, suits(top_draw(deck))==heart_sign())
   binom.test(sum(dat), N, nullval)$p.value < alpha
   })
}
mean(brej(d,NSIM=1000, N=100))
mean(brej(d,NSIM=1000, N=500))
mean(brej(d,NSIM=1000, N=100, alpha=0.01))

We have an indication here that the procedure stabilizes the Type I error rate for different designs (sample sizes) and can accommodate different significance levels.

Power curve

It is often useful to sketch the power of a test procedure over a series of values of an element of the experimental/testing setup. We'll consider how the power to reject the null varies with the sample size, for two alternatives to the null.

We make a biased deck by switching one diamond to a heart. For such a deck, the probability of a heart is 14/52, greater than 1/4.

bd = d
bd[18] = bd[4]
table(suits(bd), faces(bd))

What kind of sample size do we need to get good power to detect this exception?

ssizes = c(10,50, 100, 300,750)
rejps = sapply(ssizes, function(x) mean(brej(bd, NSIM=100, N=x)))
plot(ssizes, rejps, main="detect one extra heart")

With two switches, the probability of a heart increases to 15/52.

bd[19] = bd[5]
table(suits(bd), faces(bd))
rejps = sapply(ssizes, function(x) mean(brej(bd, NSIM=100, N=x)))
plot(ssizes, rejps, main="detect two extra hearts")

Finally, we show that under the null hypothesis (deck is fair) the rejection rate is approximately 0.05 for a range of sample sizes.

rejps_n = sapply(ssizes, function(x) mean(brej(d, NSIM=100, N=x)))
plot(ssizes, rejps_n, main="preserve Type I error rate", ylim=c(0,1))
abline(h=0.05, lty=2)

Conclusions

and produces a critical region using the binomial distribution

When the sum falls in the critical region, the null hypothesis is rejected

library(EnvStats)
propTestPower(750, p.or.p1=13/52, p0=.25 )
propTestPower(750, p.or.p1=14/52, p0=.25 )
propTestPower(750, p.or.p1=15/52, p0=.25 )

An exercise is to reconcile these results with the simulation-based results given above.

Another is to find the sample size needed to achieve 80% power to detect a single switch from diamond to heart in an otherwise fair deck.



vjcitn/CSHstats documentation built on Jan. 9, 2025, 8:37 p.m.