Lesson 6 — Watching the same biology give four different verdicts

BIO 202, Spring 2026, draft v2. Four scenarios. The same shuffle-the-predictor test from Lesson 5 — and four times when its verdict reflects how the test was set up rather than what the world is.

What you'll do

Four stages. Commit to a specific prediction before you run anything. Each stage gives the test a chance to disagree with you — pay attention to which input is doing the disagreeing.

The same trap lecture warned you about. Lecture pointed out that a penguin and a dolphin look like they should cluster together — both stiff-bodied, fast-swimming, dark-on-top-light-on-bottom marine animals. A penguin and a hummingbird, by ecology, look like they have nothing in common. Yet anatomically, penguin and hummingbird are siblings (three-digit wing, hollow bones, beak) and dolphin is several rooms away. Surface similarity is a test that gets fooled by which features you let it weigh. Today's four scenarios are the statistical version: same shuffle-the-predictor test, four ways for it to come back with a verdict that's about the test's setup rather than the underlying biology.

A — Three groups, two tests run, one not

Three groups. Two pairwise tests are already on the screen. The third one isn't. Predict it before the simulator does.

Locked — confirm your name above to begin.
One frame for the whole lesson. Every test in Stages A–E is the same regression — y ~ β · indicator — fit to two groups at a time. The slope β is the mean difference between the two groups. The empirical P is the probability of a shuffled β at least this far from zero. Every "trap" below is one way that exact β + empirical-P pair gives a verdict that doesn't say what it looks like it says.

What to do

Three groups: Low, Middle, High. Defaults: Low vs. Middle and Middle vs. High both return P > 0.20. Commit to a prediction for P_LH (the Low vs. High comparison) before unlocking the simulator and finding out.

The three groups

Pairwise empirical P-values

Low vs. Middle δ̂ = —
Middle vs. High δ̂ = —
Low vs. High δ̂ = —

Prediction

  1. Q1. Pick the range you think P_LH will fall into:
Try at least 5 slider combinations to unlock Stage B. 0/5 combos

Controls

-0.40
0.40
1.00
30
42

R code — the three pairwise comparisons

set.seed(42)n <- 30sigma <- 1.00muL <- -0.40muH <- 0.40L <- rnorm(n, muL, sigma)M <- rnorm(n, 0, sigma)H <- rnorm(n, muH, sigma)empP <- function(a, b) {  y <- c(a, b); g <- c(rep(0, length(a)), rep(1, length(b)))  d  <- coef(lm(y ~ g))[2]  dN <- replicate(1000, coef(lm(y ~ sample(g)))[2])  mean(abs(dN) >= abs(d))}empP(L, M); empP(M, H); empP(L, H)

B — Two experiments, same true effect

Identical biology. Two sample sizes (n = 30 and n = 300). Run both. Read each verdict.

Complete Stage A to unlock this section.

What to do

Two experiments are running side-by-side with the same true mean difference. The left one has 30 individuals per group; the right has 300. Move the effect-size slider and watch what happens to each verdict.

Small experiment (n = 30 per group)

δ̂:  |  P:

Large experiment (n = 300 per group)

δ̂:  |  P:

Prediction

  1. Q1. Same true δ = 0.3 in both. The two empirical P-values will:
Try at least 4 effect-size values to unlock Stage C. 0/4 values

Controls

0.30
1.00
42

R code

set.seed(42)delta <- 0.30sigma <- 1.00runP <- function(n) {  y <- c(rnorm(n, 0, sigma), rnorm(n, delta, sigma))  g <- c(rep(0, n), rep(1, n))  d  <- coef(lm(y ~ g))[2]  dN <- replicate(1000, coef(lm(y ~ sample(g)))[2])  mean(abs(dN) >= abs(d))}runP(30); runP(300)

C — Which one is "clear"?

Two experiments side by side. Decide which has the bigger underlying effect, and which one the test calls "clearly different from zero."

Complete Stage B to unlock this section.

What to do

Look at the two experiments below. Two labs ran them, on the same trait in two different species. Decide which one has the bigger underlying effect, and which one the test signals as "clearly different from zero." See if they agree.

Experiment A

δ̂:  |  P:
n / group: 200  |  within σ: 0.5

Experiment B

δ̂:  |  P:
n / group: 12  |  within σ: 4.0

Prediction

  1. Q1. Two experiments. A: small effect (δ = −0.3) but tight noise (σ = 0.5) and lots of data (n = 200). B: huge effect (δ = −1.5) but big noise (σ = 4.0) and almost no data (n = 12). Which one will clear P < 0.05?
Run at least 2 reseeds to unlock Stage D. 0/2 reseeds

Controls

42

R code

set.seed(42)# Experiment A: small effect, clean data, lots of samplesA0 <- rnorm(200, 0, 0.5); A1 <- rnorm(200, -0.3, 0.5)# Experiment B: huge effect, messy data, tiny samplesB0 <- rnorm(12, 0, 4.0);  B1 <- rnorm(12, -1.5, 4.0)empP(A0, A1); empP(B0, B1)   # empP() defined in Stage A code

D — Equal means, very different distributions

Two groups with the same mean. One slider controls the spread of the second. Run the test.

Complete Stage C to unlock this section.

What to do

Two groups, same mean. One slider controls how spread out the second group is. Watch the histograms. Watch the P-value.

Two groups (top) and the null distribution of δ̂ under shuffled labels (bottom)

σ_A: 1.0  |  σ_B: 1.0  |  observed δ̂:  |  empirical P:

Prediction

  1. Q1. Both groups have true mean 0; only σ_B varies. As σ_B grows from 1 to 5, the empirical P will:
Try at least 4 σ_B values to wrap up. 0/4 values

Controls

1.00
50
42

R code

set.seed(42)n <- 50sigma_B <- 1.00A <- rnorm(n, 0, 1)B <- rnorm(n, 0, sigma_B)empP(A, B)

E — A two-group test whose verdict is the wrong reading of the world

Same machine — empirical-P of a regression slope — applied to weekend-level data on movie attendance and crime. The slope is clearly negative. The interpretation is wrong.

Complete Stage D.

Scenario

Each point is one weekend (synthetic, faithful to the structure of Dahl & DellaVigna 2009). x = total movie attendance that weekend, in millions. y = change in violent crime rate that weekend, %. Color = whether the dominant new release was a "violent" movie (think action / superhero / horror) or not.

Fit the slope on all the points. Then color by movie type. Then ask: is "violent movies cause less crime" what the data actually said?

Weekend movie attendance vs change in violent crime rate

pooled slope:  |  within-violent slope:  |  within-non-violent slope:

Prediction

  1. Q1. The pooled slope is clearly negative: more movie attendance → less violent crime. The within-violent-movie slope and the within-non-violent-movie slope, looked at separately, are:
  2. Q2. The real story (per DellaVigna 2008): movies that draw young men into theaters reduce crime, because those are the people who'd otherwise be out committing it. "Violent" isn't doing the causal work — "attended by young men" is. Which causal claim does the pooled slope support, on its own?
Toggle the color-by-type view at least once. 0/1 toggles

Controls

R code — proxy / mediator, not collider

# Pooled slope: violent-movie attendance → fewer crimespooled <- lm(crime ~ attendance, data = weekends)# Stratify by movie type — within-stratum slopes near zerostrat  <- lm(crime ~ attendance * violent, data = weekends)summary(pooled)$coefficients[2, ]summary(strat)$coefficients

The actual DAG behind the data

DellaVigna's reading: "young men in theaters" is the unmeasured cause. "Violent movie" is a noisy proxy for it (violent releases skew toward young-male audiences). Build it below: add type → attendance (violent movies draw bigger weekends) and type → crime (those audiences are off the streets). Watch the pooled scatter mimic the empirical pattern even though "type" never directly affects "crime" through violence per se.