BIO 202 — Lesson 6: Watching the same biology give four different verdicts

What you'll do

Four stages. Commit to a specific prediction before you run anything. Each stage gives the test a chance to disagree with you — pay attention to which input is doing the disagreeing.

The same trap lecture warned you about. Lecture pointed out that a penguin and a dolphin look like they should cluster together — both stiff-bodied, fast-swimming, dark-on-top-light-on-bottom marine animals. A penguin and a hummingbird, by ecology, look like they have nothing in common. Yet anatomically, penguin and hummingbird are siblings (three-digit wing, hollow bones, beak) and dolphin is several rooms away. Surface similarity is a test that gets fooled by which features you let it weigh. Today's four scenarios are the statistical version: same shuffle-the-predictor test, four ways for it to come back with a verdict that's about the test's setup rather than the underlying biology.

A — Three groups, two tests run, one not

Three groups. Two pairwise tests are already on the screen. The third one isn't. Predict it before the simulator does.

One frame for the whole lesson. Every test in Stages A–E is the same regression — y ~ β · indicator — fit to two groups at a time. The slope β is the mean difference between the two groups. The empirical P is the probability of a shuffled β at least this far from zero. Every "trap" below is one way that exact β + empirical-P pair gives a verdict that doesn't say what it looks like it says.

What to do

Three groups: Low, Middle, High. Defaults: Low vs. Middle and Middle vs. High both return P > 0.20. Commit to a prediction for P_LH (the Low vs. High comparison) before unlocking the simulator and finding out.

The three groups

Pairwise empirical P-values

Low vs. Middle — δ̂ = —

Middle vs. High — δ̂ = —

Low vs. High — δ̂ = —

Prediction

Q1. Pick the range you think P_LH will fall into:
P ≥ 0.20 P < 0.05 0.05 ≤ P < 0.20

Try at least 5 slider combinations to unlock Stage B. 0/5 combos

Controls

Low mean-0.40

High mean0.40

within σ1.00

n / group30

seed42

R code — the three pairwise comparisons

set.seed(42)n <- 30sigma <- 1.00muL <- -0.40muH <- 0.40L <- rnorm(n, muL, sigma)M <- rnorm(n, 0, sigma)H <- rnorm(n, muH, sigma)empP <- function(a, b) {  y <- c(a, b); g <- c(rep(0, length(a)), rep(1, length(b)))  d  <- coef(lm(y ~ g))[2]  dN <- replicate(1000, coef(lm(y ~ sample(g)))[2])  mean(abs(dN) >= abs(d))}empP(L, M); empP(M, H); empP(L, H)

B — Two experiments, same true effect

Identical biology. Two sample sizes (n = 30 and n = 300). Run both. Read each verdict.

What to do

Two experiments are running side-by-side with the same true mean difference. The left one has 30 individuals per group; the right has 300. Move the effect-size slider and watch what happens to each verdict.

Small experiment (n = 30 per group)

δ̂: — | P: —

Large experiment (n = 300 per group)

δ̂: — | P: —

Prediction

Q1. Same true δ = 0.3 in both. The two empirical P-values will:
Be roughly equal Differ — the n = 300 P will be smaller Differ, but in no consistent direction

Try at least 4 effect-size values to unlock Stage C. 0/4 values

Controls

true δ0.30

within σ1.00

seed42

R code

set.seed(42)delta <- 0.30sigma <- 1.00runP <- function(n) {  y <- c(rnorm(n, 0, sigma), rnorm(n, delta, sigma))  g <- c(rep(0, n), rep(1, n))  d  <- coef(lm(y ~ g))[2]  dN <- replicate(1000, coef(lm(y ~ sample(g)))[2])  mean(abs(dN) >= abs(d))}runP(30); runP(300)

C — Which one is "clear"?

Two experiments side by side. Decide which has the bigger underlying effect, and which one the test calls "clearly different from zero."

What to do

Look at the two experiments below. Two labs ran them, on the same trait in two different species. Decide which one has the bigger underlying effect, and which one the test signals as "clearly different from zero." See if they agree.

Experiment A

δ̂: — | P: —

n / group: 200 | within σ: 0.5

Experiment B

δ̂: — | P: —

n / group: 12 | within σ: 4.0

Prediction

Q1. Two experiments. A: small effect (δ = −0.3) but tight noise (σ = 0.5) and lots of data (n = 200). B: huge effect (δ = −1.5) but big noise (σ = 4.0) and almost no data (n = 12). Which one will clear P < 0.05?
B — the bigger underlying effect always wins A — what determines P is δ̂ relative to the shuffled-label noise, and A has far less noise (lower σ and higher n) even though its effect is smaller Either, depending on the random seed

Run at least 2 reseeds to unlock Stage D. 0/2 reseeds

Controls

seed42

R code

set.seed(42)# Experiment A: small effect, clean data, lots of samplesA0 <- rnorm(200, 0, 0.5); A1 <- rnorm(200, -0.3, 0.5)# Experiment B: huge effect, messy data, tiny samplesB0 <- rnorm(12, 0, 4.0);  B1 <- rnorm(12, -1.5, 4.0)empP(A0, A1); empP(B0, B1)   # empP() defined in Stage A code

D — Equal means, very different distributions

Two groups with the same mean. One slider controls the spread of the second. Run the test.

What to do

Two groups, same mean. One slider controls how spread out the second group is. Watch the histograms. Watch the P-value.

Two groups (top) and the null distribution of δ̂ under shuffled labels (bottom)

σ_A: 1.0 | σ_B: 1.0 | observed δ̂: — | empirical P: —

Prediction

Q1. Both groups have true mean 0; only σ_B varies. As σ_B grows from 1 to 5, the empirical P will:
Shrink toward 0 Hover near 0.5 Climb toward 1

Try at least 4 σ_B values to wrap up. 0/4 values

Controls

σ_B1.00

n / group50

seed42

R code

set.seed(42)n <- 50sigma_B <- 1.00A <- rnorm(n, 0, 1)B <- rnorm(n, 0, sigma_B)empP(A, B)

E — A two-group test whose verdict is the wrong reading of the world

Same machine — empirical-P of a regression slope — applied to weekend-level data on movie attendance and crime. The slope is clearly negative. The interpretation is wrong.

Scenario

Each point is one weekend (synthetic, faithful to the structure of Dahl & DellaVigna 2009). x = total movie attendance that weekend, in millions. y = change in violent crime rate that weekend, %. Color = whether the dominant new release was a "violent" movie (think action / superhero / horror) or not.

Fit the slope on all the points. Then color by movie type. Then ask: is "violent movies cause less crime" what the data actually said?

Weekend movie attendance vs change in violent crime rate

pooled slope: — | within-violent slope: — | within-non-violent slope: —

Prediction

Q1. The pooled slope is clearly negative: more movie attendance → less violent crime. The within-violent-movie slope and the within-non-violent-movie slope, looked at separately, are:
Both still strongly negative — violence-inhibiting effect is robust Both close to zero — the pooled negative slope is an artifact of which weekends have which type of movie released Opposite signs — paradox in the direct sense
Q2. The real story (per DellaVigna 2008): movies that draw young men into theaters reduce crime, because those are the people who'd otherwise be out committing it. "Violent" isn't doing the causal work — "attended by young men" is. Which causal claim does the pooled slope support, on its own?
Violent movies cause less crime Weekends where young men are in theaters have less crime; "violent" is a noisy proxy for that. Both — the data are consistent with either.

Toggle the color-by-type view at least once. 0/1 toggles

Controls

view

R code — proxy / mediator, not collider

# Pooled slope: violent-movie attendance → fewer crimespooled <- lm(crime ~ attendance, data = weekends)# Stratify by movie type — within-stratum slopes near zerostrat  <- lm(crime ~ attendance * violent, data = weekends)summary(pooled)$coefficients[2, ]summary(strat)$coefficients

The actual DAG behind the data

DellaVigna's reading: "young men in theaters" is the unmeasured cause. "Violent movie" is a noisy proxy for it (violent releases skew toward young-male audiences). Build it below: add type → attendance (violent movies draw bigger weekends) and type → crime (those audiences are off the streets). Watch the pooled scatter mimic the empirical pattern even though "type" never directly affects "crime" through violence per se.