BIO 202, Spring 2026, draft v2. One regression, one shuffle, one tail. Used four times.
Four scenarios. In each, fit the regression, shuffle the predictor, look at where the observed coefficient sits in the cloud of shuffled ones. The tail fraction is what you'll report.
Generate group A (g = 0) and group B (g = 1). Fit y ~ α + δ·g. Shuffle g, refit, build the null histogram of shuffled δ̂'s. Read the tail fraction.
Two groups: A (g = 0, mean 0) and B (g = 1, mean δ). Both with within-group SD σ and n points each. Stack them and fit y ~ α + δ·g.
yi ~ Normal(α + δ·gi, σ)
To ask "is δ̂ clearly nonzero?", shuffle g, refit 1000 times, build the bottom histogram. The empirical P is the tail.
set.seed(42)n <- 1000sigma <- 0.2delta <- 0.01y <- c(rnorm(n, 0, sigma), rnorm(n, delta, sigma))g <- c(rep(0, n), rep(1, n))# the model: y ~ N(alpha + delta*g, sigma). delta_hat = group difference.d_obs <- coef(lm(y ~ g))[2]# shuffle g to break the relationship; refit; collect null delta_hat'sd_null <- replicate(1000, coef(lm(y ~ sample(g)))[2])# empirical two-sided P -- fraction at least as extreme as observedmean(abs(d_null) >= abs(d_obs))
A 0.5σ effect, moderate by any measure. New defaults: n = 8 per group, σ large. Rerun the same shuffle. Decide for yourself where δ̂ sits relative to the null.
Same regression. Same shuffle. New inputs: n = 8 per group, σ large, δ moderate (0.5·σ).
Watch the null histogram in the bottom panel and decide for yourself whether the observed δ̂ sits inside or outside it.
set.seed(42)n <- 8sigma <- 2.0delta <- 1.0y <- c(rnorm(n, 0, sigma), rnorm(n, delta, sigma))g <- c(rep(0, n), rep(1, n))d_obs <- coef(lm(y ~ g))[2]d_null <- replicate(1000, coef(lm(y ~ sample(g)))[2])mean(abs(d_null) >= abs(d_obs))
Five real datasets. For each one, fit y ~ α + β·year, shuffle the y values 1000 times, refit. Guess the 95% envelope of the resulting null histogram before revealing it.
Each round: a short real time series. Fit y ~ α + β·x. Shuffle the y values to build the null histogram of shuffled β̂'s.
Guess the 95% interval of that null first. Then reveal.
Rounds 1–2 are the Grant lab's Galápagos Geospiza fortis — Darwin's same archipelago, beaks measured in actual millimeters across the 1977 drought. The shuffle null is the test Darwin's argument never had.
# Per-round: x = year (or generation), y = trait mean / fitness.b_obs <- coef(lm(y ~ x))[2]b_null <- replicate(1000, { coef(lm(sample(y) ~ x))[2]})quantile(b_null, c(0.025, 0.975)) # 95% null envelopemean(abs(b_null) >= abs(b_obs)) # empirical two-sided P
You just got an empirical P. The slope β̂(beak depth ~ year) for the 1973–1977 Grant fortis is far outside the shuffle null. The test says "year predicts beak depth." But "year" doesn't do anything to a finch — it's a label on the x-axis.
Build a quick causal model below. Add arrows. Watch the simulated data on the right change. Then ask: which model would produce the slope you just observed, and which would not?
The "year → beak" arrow is what the shuffle test directly evaluates. The "drought → year, drought → beak" pair is the story Darwin would have written. The test cannot distinguish them — it's not the test's job. The job of the test is to tell you the apparent slope isn't from random reshuffling; the job of the DAG is to tell you what to do next.
Three sliders: δ, n, σ. Set them to produce three target scenarios.
Three challenges, in order:
set.seed(42)n <- 30sigma <- 1.0delta <- 0.5y <- c(rnorm(n, 0, sigma), rnorm(n, delta, sigma))g <- c(rep(0, n), rep(1, n))d_obs <- coef(lm(y ~ g))[2]d_null <- replicate(1000, coef(lm(y ~ sample(g)))[2])mean(abs(d_null) >= abs(d_obs))
Take the "huge δ, small n, P > 0.05" scenario and ask: how many more samples per group would I need to make this difference clearly distinguishable from the shuffled-label null? Re-run the shuffle null at several n values and find the smallest n where the empirical P stays below 0.05 across, say, 10 different seeds. Report the required n. Hit "I tried it" after you have a number.